Statystyka wielowymiarowa

Przemysław Węglik 17.06.2024

Dataset

Użyłem datasetu: Most Streamed Spotify Songs 2023. Dane zawierają różnorodne informacje dotyczące najczęściej odsłuchiwamnych utworów dostępnych na Spotify w roku 2023. Zbiór danych obejmuje zarówno cechy utworów, jak i informacje o artystach oraz interakcjach użytkowników z muzyką. Poniżej znajduje się przykładowy przegląd tego, co zawierają dane:

Informacje o utworach

Zbiór danych zawiera podstawowe informacje o utworach, takie jak identyfikator utworu, jego nazwa, nazwa artysty, nazwa albumu oraz data wydania. Szczególnie interesującą nas wartością jest liczba odsłuchań, którą będziemy próbowali przewidywać różnymi metodami.

Informacje o artystach

Dane o artystach obejmują ich unikalne identyfikatory, nazwy, oraz gatunki muzyczne, które reprezentują.

Cechy audio

Spotify dostarcza wiele danych opisujących cechy audio utworów. Na przykład, “danceability” (czyli jak dobrze utwór nadaje się do tańczenia), “energy” (energia utworu), klucz muzyczny, głośność, tryb (major lub minor), oraz “speechiness” (obecność słów mówionych w utworze). Inne przykłady to “acousticness” (akustyczność), “instrumentalness” (obecność instrumentalnych partii) oraz “liveness” (to czy utwór jest nagrany na żywo).

Regersja liniowa

Ładowanie zbioru danych

Zmieniamy nazwy kolumn i delikatnie czyścimy

dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"

dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
dataframe = na.omit(dataframe)

head(dataframe)
##                            track_name    artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.)  Latto, Jung Kook            2
## 2                                LALA       Myke Towers            1
## 3                             vampire    Olivia Rodrigo            1
## 4                        Cruel Summer      Taylor Swift            1
## 5                      WHERE SHE GOES         Bad Bunny            1
## 6                            Sprinter Dave, Central Cee            2
##   released_year released_month released_day in_spotify_playlists
## 1          2023              7           14                  553
## 2          2023              3           23                 1474
## 3          2023              6           30                 1397
## 4          2019              8           23                 7858
## 5          2023              5           18                 3133
## 6          2023              6            1                 2186
##   in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1               147 141381703                 43             263
## 2                48 133716286                 48             126
## 3               113 140003974                 94             207
## 4               100 800840817                116             207
## 5                50 303236322                 84             133
## 6                91 183706234                 67             213
##   in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                  45               10              826 125   B Major
## 2                  58               14              382  92  C# Major
## 3                  91               14              949 138   F Major
## 4                 125               12              548 170   A Major
## 5                  87               15              425 144   A Minor
## 6                  88               17              946 141  C# Major
##   danceability valence energy acousticness instrumentalness liveness
## 1           80      89     83           31                0        8
## 2           71      61     74            7                0       10
## 3           51      32     53           17                0       31
## 4           55      58     72           11                0       11
## 5           65      23     80           14               63       11
## 6           92      66     58           19                0        8
##   speechiness
## 1           4
## 2           4
## 3           6
## 4          15
## 5           6
## 6          24

Prosta regresja liniowa

fit_ac <- lm(streams ~ acousticness, data = dataframe)
summary(fit_ac)
## 
## Call:
## lm(formula = streams ~ acousticness, data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -514924519 -371670699 -222444207  157511754 3187110177 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  516784897   26546437  19.467   <2e-16 ***
## acousticness    -97769     707306  -0.138     0.89    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 567100000 on 950 degrees of freedom
## Multiple R-squared:  2.011e-05,  Adjusted R-squared:  -0.001032 
## F-statistic: 0.01911 on 1 and 950 DF,  p-value: 0.8901
fit_lv <- lm(dataframe$streams ~ dataframe$liveness)
summary(fit_lv)
## 
## Call:
## lm(formula = dataframe$streams ~ dataframe$liveness)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -528544084 -367994944 -225320538  143020359 3171353537 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        550517646   30528184  18.033   <2e-16 ***
## dataframe$liveness  -1997346    1339063  -1.492    0.136    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566500000 on 950 degrees of freedom
## Multiple R-squared:  0.002336,   Adjusted R-squared:  0.001286 
## F-statistic: 2.225 on 1 and 950 DF,  p-value: 0.1361
fit_sp <- lm(dataframe$streams ~ dataframe$speechiness)
summary(fit_sp)
## 
## Call:
## lm(formula = dataframe$streams ~ dataframe$speechiness)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -553557137 -368358821 -208002662  153643116 3169601189 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           579247918   26130426  22.168  < 2e-16 ***
## dataframe$speechiness  -6422005    1843079  -3.484 0.000516 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 563600000 on 950 degrees of freedom
## Multiple R-squared:  0.01262,    Adjusted R-squared:  0.01158 
## F-statistic: 12.14 on 1 and 950 DF,  p-value: 0.0005158

Wykresy prostej regresji liniowej

Prosta regresji na tle danych

{
  plot = plot(dataframe$speechiness, dataframe$streams)
  abline(fit_sp, col="red")
}

{
  fit <- lm(dataframe$liveness ~ dataframe$energy)
  plot = plot(dataframe$energy, dataframe$liveness)
  abline(fit, col="red")
}

Wnioski

Liczba odsłuchań jest negatywnie skorelowana z liczbą słów w piosence. - energetyczność piosenki jest pozytywnie skorelowana z elementami wystąpień na żywo

Regresja wielokrotna

fit_some <- lm(streams ~ key + energy + liveness, data = dataframe)
summary(fit_some)
## 
## Call:
## lm(formula = streams ~ key + energy + liveness, data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -614216253 -370069060 -215158836  153375317 3094635971 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  614097404   93535709   6.565 8.58e-11 ***
## keyA        -116820333   88036466  -1.327    0.185    
## keyA#         24779965   95119622   0.261    0.795    
## keyB           1222099   85925559   0.014    0.989    
## keyC#         85096244   77964824   1.091    0.275    
## keyD           8376519   85781441   0.098    0.922    
## keyD#         28828680  114624798   0.252    0.801    
## keyE          55584746   92624018   0.600    0.549    
## keyF         -53002660   83681170  -0.633    0.527    
## keyF#          4296424   88347586   0.049    0.961    
## keyG         -70052102   82091254  -0.853    0.394    
## keyG#        -42406417   83205929  -0.510    0.610    
## energy         -913936    1127324  -0.811    0.418    
## liveness      -1868851    1352587  -1.382    0.167    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 567200000 on 938 degrees of freedom
## Multiple R-squared:  0.01246,    Adjusted R-squared:  -0.001227 
## F-statistic: 0.9103 on 13 and 938 DF,  p-value: 0.5417

Regresja bez niektórych zmiennych

fit_filtered <- lm(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe)
summary(fit_filtered)
## 
## Call:
## lm(formula = streams ~ . - track_name - artist.s._name - released_day - 
##     in_deezer_playlists - in_shazam_charts, data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.508e+09 -1.502e+08 -3.670e+07  1.113e+08  2.208e+09 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6.612e+09  2.027e+09  -3.261 0.001149 ** 
## artist_count         -3.454e+07  1.121e+07  -3.082 0.002116 ** 
## released_year         3.416e+06  1.006e+06   3.397 0.000711 ***
## released_month        3.842e+06  2.756e+06   1.394 0.163678    
## in_spotify_playlists  3.566e+04  1.930e+03  18.475  < 2e-16 ***
## in_spotify_charts     3.546e+06  6.991e+05   5.072 4.75e-07 ***
## in_apple_playlists    2.929e+06  1.807e+05  16.211  < 2e-16 ***
## in_apple_charts      -4.344e+05  2.470e+05  -1.759 0.078949 .  
## in_deezer_charts     -6.296e+06  2.145e+06  -2.935 0.003416 ** 
## bpm                  -5.486e+04  3.503e+05  -0.157 0.875573    
## keyA                  2.155e+06  4.627e+07   0.047 0.962869    
## keyA#                 6.729e+07  5.015e+07   1.342 0.180018    
## keyB                  2.290e+07  4.533e+07   0.505 0.613595    
## keyC#                 5.622e+05  4.122e+07   0.014 0.989122    
## keyD                 -1.328e+07  4.486e+07  -0.296 0.767286    
## keyD#                 3.669e+07  6.040e+07   0.607 0.543688    
## keyE                  5.071e+07  4.970e+07   1.020 0.307822    
## keyF                  2.183e+07  4.402e+07   0.496 0.620138    
## keyF#                -2.444e+07  4.673e+07  -0.523 0.601035    
## keyG                 -4.475e+07  4.276e+07  -1.047 0.295569    
## keyG#                 6.018e+06  4.368e+07   0.138 0.890445    
## modeMinor             6.419e+06  2.079e+07   0.309 0.757626    
## danceability         -6.340e+05  7.986e+05  -0.794 0.427473    
## valence              -3.622e+05  5.071e+05  -0.714 0.475234    
## energy               -1.237e+06  7.839e+05  -1.578 0.114940    
## acousticness          7.332e+05  4.769e+05   1.537 0.124563    
## instrumentalness     -1.016e+06  1.157e+06  -0.878 0.379913    
## liveness              3.977e+05  7.101e+05   0.560 0.575609    
## speechiness          -1.312e+06  1.017e+06  -1.290 0.197265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 294100000 on 923 degrees of freedom
## Multiple R-squared:  0.7388, Adjusted R-squared:  0.7309 
## F-statistic: 93.25 on 28 and 923 DF,  p-value: < 2.2e-16
mean(summary(fit_some)$residuals^2)
## [1] 3.169899e+17
mean(summary(fit_filtered)$residuals^2)
## [1] 8.383588e+16

Wniosek

Dołożenie większej liczby cech zmniejszyło średnio błąd kwadratowy kilkukrotnie.

Interakcje między zmiennymi

summary(lm(streams ~ energy * liveness, data = dataframe))
## 
## Call:
## lm(formula = streams ~ energy * liveness, data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -527081102 -371729201 -223431455  144319986 3177832375 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     530659683  125506028   4.228 2.58e-05 ***
## energy             214249    1835562   0.117    0.907    
## liveness          1802778    5998489   0.301    0.764    
## energy:liveness    -52725      83289  -0.633    0.527    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566900000 on 948 degrees of freedom
## Multiple R-squared:  0.003179,   Adjusted R-squared:  2.45e-05 
## F-statistic: 1.008 on 3 and 948 DF,  p-value: 0.3885

Nieliniowe transformacje predyktorów

fit_l2 <- lm(streams ~ bpm + I(bpm^2), data = dataframe)
summary(fit_l2)
## 
## Call:
## lm(formula = streams ~ bpm + I(bpm^2), data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -545793552 -367775793 -224042960  166139165 3157050040 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1068629196  316595991   3.375 0.000767 ***
## bpm           -9045609    5057220  -1.789 0.073990 .  
## I(bpm^2)         35054      19540   1.794 0.073131 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566500000 on 949 degrees of freedom
## Multiple R-squared:  0.003386,   Adjusted R-squared:  0.001286 
## F-statistic: 1.612 on 2 and 949 DF,  p-value: 0.2
anova(fit_ac, fit_l2)
## Analysis of Variance Table
## 
## Model 1: streams ~ acousticness
## Model 2: streams ~ bpm + I(bpm^2)
##   Res.Df        RSS Df  Sum of Sq     F  Pr(>F)  
## 1    950 3.0558e+20                              
## 2    949 3.0455e+20  1 1.0285e+18 3.205 0.07373 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regresja wielomianowa wyższego stopnia może wykorzystywać funkcję poly()

fit_l5 <- lm(streams ~ poly(bpm, 5), data = dataframe)
summary(fit_l5)
## 
## Call:
## lm(formula = streams ~ poly(bpm, 5), data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -647140156 -365012931 -217934211  136175023 3095450018 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.141e+08  1.828e+07  28.120   <2e-16 ***
## poly(bpm, 5)1 -4.262e+07  5.641e+08  -0.076   0.9398    
## poly(bpm, 5)2  1.016e+09  5.641e+08   1.802   0.0719 .  
## poly(bpm, 5)3  7.897e+08  5.641e+08   1.400   0.1619    
## poly(bpm, 5)4 -1.114e+09  5.641e+08  -1.975   0.0486 *  
## poly(bpm, 5)5 -1.274e+09  5.641e+08  -2.258   0.0242 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 564100000 on 946 degrees of freedom
## Multiple R-squared:  0.0148, Adjusted R-squared:  0.009589 
## F-statistic: 2.841 on 5 and 946 DF,  p-value: 0.01484

Logarytmiczna transformacja predyktora

summary(lm(streams ~ log(bpm), data = dataframe))
## 
## Call:
## lm(formula = streams ~ log(bpm), data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -515505699 -371231452 -221469931  161070391 3197405857 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 615939052  383296408   1.607    0.108
## log(bpm)    -21286853   80055568  -0.266    0.790
## 
## Residual standard error: 567100000 on 950 degrees of freedom
## Multiple R-squared:  7.442e-05,  Adjusted R-squared:  -0.0009781 
## F-statistic: 0.0707 on 1 and 950 DF,  p-value: 0.7904

Wnioski

Stosowanie bardziej skomplikowanych przekształceń zwięszka RSE, a zatem jakość predykcji pogarsza się.

Klasyfikacja

Ładowanie zbioru danych

dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"

dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
head(dataframe)
##                            track_name    artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.)  Latto, Jung Kook            2
## 2                                LALA       Myke Towers            1
## 3                             vampire    Olivia Rodrigo            1
## 4                        Cruel Summer      Taylor Swift            1
## 5                      WHERE SHE GOES         Bad Bunny            1
## 6                            Sprinter Dave, Central Cee            2
##   released_year released_month released_day in_spotify_playlists
## 1          2023              7           14                  553
## 2          2023              3           23                 1474
## 3          2023              6           30                 1397
## 4          2019              8           23                 7858
## 5          2023              5           18                 3133
## 6          2023              6            1                 2186
##   in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1               147 141381703                 43             263
## 2                48 133716286                 48             126
## 3               113 140003974                 94             207
## 4               100 800840817                116             207
## 5                50 303236322                 84             133
## 6                91 183706234                 67             213
##   in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                  45               10              826 125   B Major
## 2                  58               14              382  92  C# Major
## 3                  91               14              949 138   F Major
## 4                 125               12              548 170   A Major
## 5                  87               15              425 144   A Minor
## 6                  88               17              946 141  C# Major
##   danceability valence energy acousticness instrumentalness liveness
## 1           80      89     83           31                0        8
## 2           71      61     74            7                0       10
## 3           51      32     53           17                0       31
## 4           55      58     72           11                0       11
## 5           65      23     80           14               63       11
## 6           92      66     58           19                0        8
##   speechiness
## 1           4
## 2           4
## 3           6
## 4          15
## 5           6
## 6          24
library(class)
library(MASS)

Regresja logistyczna - multinomial

library(nnet) 

model <- multinom(key ~ mode + danceability + valence + energy + acousticness, data = dataframe) 
## # weights:  84 (66 variable)
## initial  value 2368.116037 
## iter  10 value 2325.042927
## iter  20 value 2322.501362
## iter  30 value 2289.789215
## iter  40 value 2273.311980
## iter  50 value 2250.119530
## iter  60 value 2245.724806
## final  value 2245.641183 
## converged
summary(model)
## Call:
## multinom(formula = key ~ mode + danceability + valence + energy + 
##     acousticness, data = dataframe)
## 
## Coefficients:
##    (Intercept)  modeMinor  danceability      valence        energy
## A   1.29606649  1.1214705 -0.0004739765 -0.008824665 -1.833282e-02
## A# -0.82550988  1.3649660  0.0173909802 -0.005349928 -1.317957e-02
## B  -2.50188219  1.5174331  0.0211078471 -0.010757154  1.474683e-02
## C# -0.56168512  0.8019744  0.0211956324 -0.014643921  3.208753e-03
## D  -0.84797771 -0.2253672  0.0233053132 -0.011721178 -1.789800e-03
## D# -1.14176624  1.9424135  0.0015235406 -0.016029464  1.354838e-04
## E  -1.71322646  2.3858951  0.0133787065 -0.032879557  1.006282e-02
## F  -0.83124926  1.3189090  0.0093076635 -0.005296897  6.422626e-05
## F# -2.52137966  1.6311152  0.0062304434  0.006098420  1.023718e-02
## G  -0.01306834  0.4715483  0.0121626795 -0.001725732 -9.276453e-03
## G#  0.72141622  0.4613253  0.0053704694 -0.001717764 -1.210428e-02
##     acousticness
## A  -0.0093504718
## A# -0.0074776880
## B  -0.0016640732
## C# -0.0119998469
## D  -0.0030400454
## D# -0.0008041040
## E   0.0074040615
## F  -0.0009512604
## F#  0.0065047107
## G  -0.0075410260
## G# -0.0134505680
## 
## Std. Errors:
##    (Intercept) modeMinor danceability     valence     energy acousticness
## A     1.127693 0.3484438   0.01187603 0.008001999 0.01252731  0.007560088
## A#    1.274097 0.3705158   0.01327752 0.008629937 0.01363357  0.008316290
## B     1.199332 0.3412472   0.01216630 0.007866253 0.01260662  0.007707144
## C#    1.054866 0.3181438   0.01081752 0.007128288 0.01126190  0.007036789
## D     1.145787 0.3849299   0.01181744 0.007791822 0.01226565  0.007433363
## D#    1.500188 0.4481345   0.01561686 0.010553201 0.01652615  0.009885615
## E     1.257575 0.3888066   0.01288685 0.008869717 0.01374536  0.008085948
## F     1.119000 0.3330552   0.01158835 0.007625619 0.01206428  0.007291477
## F#    1.221157 0.3506954   0.01253501 0.008157870 0.01290945  0.007750776
## G     1.085807 0.3381602   0.01126353 0.007411170 0.01171786  0.007152007
## G#    1.092456 0.3430049   0.01135511 0.007525217 0.01187366  0.007348994
## 
## Residual Deviance: 4491.282 
## AIC: 4623.282

Wnioski

Tryb (mode) mocno koreluje z niektórymi z kluczy, inne pojedyncze cechy również nieco korelują, przydałaby się tutaj normalizacja danych, aby lepiej porównać.

training_pred <- predict(model, dataframe, type="class")

accuracy <- mean(training_pred == dataframe$key)
accuracy
## [1] 0.1511018

Wnioski

Accuracy nie powala, ale klucz może byc czymś bardzo trudnym do przewidzenia - mamy 11 klas, model czegoś się jednak nauczył.

library(caTools)
trainIndex <- sample.split(dataframe$key, SplitRatio = 0.8)
train <- subset(dataframe, trainIndex)
test <- subset(dataframe, !trainIndex)

Regresję wykonujemy na podstawie zbioru uczącego

model2 <- multinom(key ~ mode + danceability + valence + energy + acousticness, data = train) 
## # weights:  84 (66 variable)
## initial  value 1895.983774 
## iter  10 value 1863.977396
## iter  20 value 1859.999453
## iter  30 value 1848.445103
## iter  40 value 1831.596045
## iter  50 value 1798.027160
## iter  60 value 1794.070430
## final  value 1793.866926 
## converged
summary(model2)
## Call:
## multinom(formula = key ~ mode + danceability + valence + energy + 
##     acousticness, data = train)
## 
## Coefficients:
##    (Intercept)   modeMinor danceability      valence       energy acousticness
## A   0.06575718  1.06495427  0.003170283 -0.008995063 -0.004549996 -0.003171147
## A# -1.10776674  1.41517000  0.018149726 -0.007248873 -0.008969794 -0.005833817
## B  -1.98803425  1.40698964  0.021133889 -0.013757719  0.011553600 -0.004464912
## C# -1.37807325  0.84316696  0.026707953 -0.016777222  0.011310846 -0.012467825
## D  -1.20770371 -0.08102582  0.031691509 -0.013433198 -0.004592226 -0.001786508
## D# -0.81974759  2.01047114 -0.001771200 -0.012882544 -0.003837304 -0.003311427
## E  -2.00392420  2.35965625  0.014690998 -0.035410001  0.014592425  0.009491489
## F  -1.53735721  1.25579711  0.011743409 -0.004245761  0.006465475  0.002730664
## F# -2.98189647  1.71646417  0.005506615  0.004932785  0.016547916  0.010229096
## G  -0.02066264  0.28791529  0.020489468 -0.006370421 -0.011185816 -0.012902503
## G#  0.31858356  0.41110765  0.007904632 -0.001923161 -0.008984386 -0.011010900
## 
## Std. Errors:
##    (Intercept) modeMinor danceability     valence     energy acousticness
## A     1.295237 0.3898043   0.01347035 0.008975690 0.01399429  0.008455955
## A#    1.442345 0.4148728   0.01496439 0.009698123 0.01522954  0.009266060
## B     1.341437 0.3816138   0.01361867 0.008856463 0.01401051  0.008622799
## C#    1.218650 0.3564425   0.01236211 0.008069199 0.01271568  0.008066969
## D     1.310895 0.4186807   0.01354546 0.008768568 0.01374315  0.008305389
## D#    1.684605 0.5084480   0.01761437 0.011911359 0.01844870  0.011056641
## E     1.422020 0.4336138   0.01447390 0.009980614 0.01535764  0.008931496
## F     1.279674 0.3735852   0.01315609 0.008599255 0.01350568  0.008147586
## F#    1.390711 0.3954521   0.01417971 0.009239454 0.01446861  0.008655792
## G     1.241141 0.3836586   0.01284034 0.008371558 0.01312379  0.008169692
## G#    1.241791 0.3850781   0.01286565 0.008469311 0.01324551  0.008218378
## 
## Residual Deviance: 3587.734 
## AIC: 3719.734

a otrzymany model wykorzystujemy do predykcji dla danych ze zbioru testowego

testing_pred <- predict(model, test, type="class")

accuracy <- mean(testing_pred == test$key)
accuracy
## [1] 0.1421053

Wnioski

Accuracy nawet się poprawiło bo podziale na zbiór treningowy i uczący, byćmoże trafiliśmy na “łatwy” zbiór testowy.

kNN

train <- subset(dataframe, trainIndex)
test <- subset(dataframe, !trainIndex)
train = na.omit(train)[c("key", "danceability", "valence", "energy", "acousticness")]
test = na.omit(test)[c("key", "danceability", "valence", "energy", "acousticness")]

train_labels <- train$key
test_labels <- test$key

# Remove 'key' column from train and test
train$key <- NULL
test$key <- NULL

knn_result <- knn(train=train, test=test, cl=train_labels, k=3)

confusionMatrix <- table(pred=knn_result, true=test_labels)

accuracy <- sum(diag(confusionMatrix)) / sum(confusionMatrix)
print(accuracy)
## [1] 0.08947368

Wnioski

Gorze wyniki niż dla normalnej klasyfikacji, kNN słabo działa dal tego zbioru danych.

Resampling

Ładowanie zbioru danych

dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"

dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
library(class)
library(MASS)
head(dataframe)
##                            track_name    artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.)  Latto, Jung Kook            2
## 2                                LALA       Myke Towers            1
## 3                             vampire    Olivia Rodrigo            1
## 4                        Cruel Summer      Taylor Swift            1
## 5                      WHERE SHE GOES         Bad Bunny            1
## 6                            Sprinter Dave, Central Cee            2
##   released_year released_month released_day in_spotify_playlists
## 1          2023              7           14                  553
## 2          2023              3           23                 1474
## 3          2023              6           30                 1397
## 4          2019              8           23                 7858
## 5          2023              5           18                 3133
## 6          2023              6            1                 2186
##   in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1               147 141381703                 43             263
## 2                48 133716286                 48             126
## 3               113 140003974                 94             207
## 4               100 800840817                116             207
## 5                50 303236322                 84             133
## 6                91 183706234                 67             213
##   in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                  45               10              826 125   B Major
## 2                  58               14              382  92  C# Major
## 3                  91               14              949 138   F Major
## 4                 125               12              548 170   A Major
## 5                  87               15              425 144   A Minor
## 6                  88               17              946 141  C# Major
##   danceability valence energy acousticness instrumentalness liveness
## 1           80      89     83           31                0        8
## 2           71      61     74            7                0       10
## 3           51      32     53           17                0       31
## 4           55      58     72           11                0       11
## 5           65      23     80           14               63       11
## 6           92      66     58           19                0        8
##   speechiness
## 1           4
## 2           4
## 3           6
## 4          15
## 5           6
## 6          24

Walidacja krzyżowa

Usuwamy NA

dataframe <- na.omit(dataframe)

Metoda zbioru walidacyjnego

Tworzymy zbiór uczący z połowy dostępnych obserwacji — reszta będzie stanowić zbiór walidacyjny. Dla zapewnienia powtarzalności obliczeń stosujemy funkcję set.seed.

set.seed(1)
n <- nrow(dataframe)
train <- sample(n, n / 2)

Dopasowujemy model liniowy na zbiorze uczącym, następnie obliczamy MSE dla zbioru walidacyjnego.

dataframe_lm <- lm(streams ~ liveness, data = dataframe, subset = train)
validation_set <- dataframe[-train,]
mse <- mean((validation_set$streams - predict(dataframe_lm, validation_set))^2)
mse
## [1] 2.818154e+17

Powtarzamy to samo dla regresji wielomianowej wyższych stopni

for (i in 2:5) {
  dataframe_lm_poly <- lm(streams ~ poly(liveness, degree = i), data = dataframe, 
                     subset = train)
  print(mean((validation_set$streams - predict(dataframe_lm_poly, validation_set))^2))
}
## [1] 2.818228e+17
## [1] 2.822097e+17
## [1] 2.825869e+17
## [1] 2.829181e+17

Wnioski

Ciężko wnioskować liczbę odsłuchań na podstawie tego czy piosenka jest wyknoywana na żywo. A jeśli juz musimy to lepiej użyć niskiego współczynnika wielomianu.

Powtarzamy obliczenia dla innego zbioru walidacyjnego.

set.seed(2)
train <- sample(n, n / 2)
validation_set <- dataframe[-train,]
degree_max <- 5
mse <- rep(0, times = degree_max)
for (i in 1:degree_max) {
  dataframe_lm <- lm(streams ~ poly(liveness, degree = i), data = dataframe, subset = train)
  mse[i] <- mean((validation_set$streams - predict(dataframe_lm, validation_set))^2)
}
mse
## [1] 2.631586e+17 2.631819e+17 2.776190e+17 2.682317e+17 2.801860e+17

Wnioski

Nieco inne wyniki - co było oczekiwane

Otrzymane wyniki można zobrazować na wykresie

plot(mse, xlab = "Stopień wielomianu", ylab = "MSE", type = "b", pch = 20, 
     col = "blue")

Walidacja krzyżowa bez jednego (leave-one-out)

Walidację krzyżową dla uogólnionych modeli liniowych wykonuje funkcja cv.glm() z pakietu boot. Jej argumentem (glmfit) jest obiekt klasy glm, więc jeśli chcemy jej użyć do walidacji zwykłych modeli liniowych, musimy je dopasowywać jako uogólnione modele liniowe (z family = gaussian, co zresztą jest wartością domyślną). Funkcja cv.glm() zwraca listę (zobacz ?cv.glm), której najbardziej interesującą składawą jest delta — wektor o długości 2 zawierający estymatę błędu predykcji w wersji oryginalnej i skorygowaną dla uwzględnienia obciążenia wprowadzanego przez walidację krzyżową inną niż LOOCV.

library(boot)

compute_loocv_mse <- function(degree) {
  dataframe_glm <- glm(streams ~ poly(liveness, degree), data = dataframe)
  cv.glm(dataframe, dataframe_glm)$delta[1]
}
mse <- sapply(1:degree_max, compute_loocv_mse)
mse
## [1] 3.213943e+17 3.217088e+17 3.217095e+17 3.218216e+17 3.222182e+17

Można też narysować obrazek

plot(mse, xlab = "Stopień wielomianu", ylab = "LOOCV MSE", type = "b", pch = 20, 
     col = "blue")

[Co teraz z wnioskami na temat regresji wielomianowej w naszym przypadku?]

MSE jest jeszcze gorsze.

\(k\)-krotna walidacja krzyżowa

Podobnie korzystamy z funkcji cv.glm(), tylko teraz jawnie ustawiamy parametr K oznaczający liczbę grup (folds). Np. dla \(k = 10\) wygląda to jak poniżej.

compute_kcv_mse <- function(degree, k) {
  dataframe_glm <- glm(streams ~ poly(liveness, degree), data = dataframe)
  cv.glm(dataframe, dataframe_glm, K = k)$delta[1]
}
mse <- sapply(1:degree_max, compute_kcv_mse, k = 10)
mse
## [1] 3.212015e+17 3.219672e+17 3.225988e+17 3.225096e+17 3.215037e+17

Oczywiście tym razem wyniki są losowe. Możemy zrobić ich zestawienie dla np. 10 prób.

mse10 <- replicate(10, sapply(1:degree_max, compute_kcv_mse, k = 10))
mse10
##              [,1]         [,2]         [,3]         [,4]         [,5]
## [1,] 3.218311e+17 3.211233e+17 3.209605e+17 3.217964e+17 3.216730e+17
## [2,] 3.216456e+17 3.218086e+17 3.224292e+17 3.216497e+17 3.215884e+17
## [3,] 3.225961e+17 3.220367e+17 3.212582e+17 3.225169e+17 3.239123e+17
## [4,] 3.219497e+17 3.216228e+17 3.213630e+17 3.211560e+17 3.214825e+17
## [5,] 3.226594e+17 3.226924e+17 3.218692e+17 3.219549e+17 3.228773e+17
##              [,6]         [,7]         [,8]         [,9]        [,10]
## [1,] 3.208841e+17 3.214266e+17 3.213332e+17 3.215641e+17 3.211343e+17
## [2,] 3.219493e+17 3.214436e+17 3.213963e+17 3.225785e+17 3.224823e+17
## [3,] 3.223226e+17 3.217683e+17 3.225869e+17 3.223603e+17 3.212580e+17
## [4,] 3.223637e+17 3.227963e+17 3.216905e+17 3.221527e+17 3.222117e+17
## [5,] 3.224161e+17 3.222716e+17 3.229107e+17 3.235770e+17 3.231393e+17

Wnioski

Nie mieliśmy pecha, po prostu dopasowywanie modeli gaussowskich tutaj nie działa. Prawdopodbnie przewidywanie liczby odsłuchań jest po prostu strasznie trudne.

Bootstrap

Użyjemy metody bootstrap do oszacowania błędów standardowych współczynników regresji liniowej. Podstawową funkcją jest tutaj boot() z pakietu boot. Wymaga ona jako parametru funkcji obliczającej interesującą statystykę dla podanego zbioru danych. Ta ostatnia funkcja powinna akceptować dwa parametry: zbiór danych oraz wektor indeksów (istnieją też inne możliwości: ?boot).

lm_coefs <- function(data, index = 1:nrow(data)) {
  coef(lm(streams ~ liveness, data = dataframe, subset = index))
}

Funkcja lm_coefs() oblicza estymaty współczynników regresji dla zbioru danych typu bootstrap utworzonego z Auto:

n <- nrow(dataframe)
lm_coefs(dataframe, sample(n, n, replace = TRUE))
## (Intercept)    liveness 
## 482955426.4   -842727.7

Oczywiście jednym z takich zbiorów jest sam oryginał

lm_coefs(dataframe)
## (Intercept)    liveness 
##   550517646    -1997345

Obliczenie błędów standardowych metodą bootstrap z 1000 replikacji wygląda następująco.

boot(dataframe, lm_coefs, R = 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = dataframe, statistic = lm_coefs, R = 1000)
## 
## 
## Bootstrap Statistics :
##      original      bias    std. error
## t1* 550517646 -419644.789    29545083
## t2*  -1997345   -6663.118     1147976

Selekcja

Ładowanie zbioru danych

dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"

dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
head(dataframe)
##                            track_name    artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.)  Latto, Jung Kook            2
## 2                                LALA       Myke Towers            1
## 3                             vampire    Olivia Rodrigo            1
## 4                        Cruel Summer      Taylor Swift            1
## 5                      WHERE SHE GOES         Bad Bunny            1
## 6                            Sprinter Dave, Central Cee            2
##   released_year released_month released_day in_spotify_playlists
## 1          2023              7           14                  553
## 2          2023              3           23                 1474
## 3          2023              6           30                 1397
## 4          2019              8           23                 7858
## 5          2023              5           18                 3133
## 6          2023              6            1                 2186
##   in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1               147 141381703                 43             263
## 2                48 133716286                 48             126
## 3               113 140003974                 94             207
## 4               100 800840817                116             207
## 5                50 303236322                 84             133
## 6                91 183706234                 67             213
##   in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                  45               10              826 125   B Major
## 2                  58               14              382  92  C# Major
## 3                  91               14              949 138   F Major
## 4                 125               12              548 170   A Major
## 5                  87               15              425 144   A Minor
## 6                  88               17              946 141  C# Major
##   danceability valence energy acousticness instrumentalness liveness
## 1           80      89     83           31                0        8
## 2           71      61     74            7                0       10
## 3           51      32     53           17                0       31
## 4           55      58     72           11                0       11
## 5           65      23     80           14               63       11
## 6           92      66     58           19                0        8
##   speechiness
## 1           4
## 2           4
## 3           6
## 4          15
## 5           6
## 6          24

Selekcja cech dla modeli liniowych

dataframe <- na.omit(dataframe)

Metody selekcji cech są zaimplementowane w funkcji regsubsets() z pakietu leaps.

Wybór najepszego podzbioru

dataframe_bs <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, really.big=T)
summary(dataframe_bs)
## Subset selection object
## Call: regsubsets.formula(streams ~ . - track_name - artist.s._name - 
##     released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, 
##     really.big = T)
## 28 Variables  (and intercept)
##                      Forced in Forced out
## artist_count             FALSE      FALSE
## released_year            FALSE      FALSE
## released_month           FALSE      FALSE
## in_spotify_playlists     FALSE      FALSE
## in_spotify_charts        FALSE      FALSE
## in_apple_playlists       FALSE      FALSE
## in_apple_charts          FALSE      FALSE
## in_deezer_charts         FALSE      FALSE
## bpm                      FALSE      FALSE
## keyA                     FALSE      FALSE
## keyA#                    FALSE      FALSE
## keyB                     FALSE      FALSE
## keyC#                    FALSE      FALSE
## keyD                     FALSE      FALSE
## keyD#                    FALSE      FALSE
## keyE                     FALSE      FALSE
## keyF                     FALSE      FALSE
## keyF#                    FALSE      FALSE
## keyG                     FALSE      FALSE
## keyG#                    FALSE      FALSE
## modeMinor                FALSE      FALSE
## danceability             FALSE      FALSE
## valence                  FALSE      FALSE
## energy                   FALSE      FALSE
## acousticness             FALSE      FALSE
## instrumentalness         FALSE      FALSE
## liveness                 FALSE      FALSE
## speechiness              FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          artist_count released_year released_month in_spotify_playlists
## 1  ( 1 ) " "          " "           " "            "*"                 
## 2  ( 1 ) " "          " "           " "            "*"                 
## 3  ( 1 ) " "          " "           " "            "*"                 
## 4  ( 1 ) " "          " "           " "            "*"                 
## 5  ( 1 ) "*"          " "           " "            "*"                 
## 6  ( 1 ) "*"          "*"           " "            "*"                 
## 7  ( 1 ) "*"          "*"           " "            "*"                 
## 8  ( 1 ) "*"          "*"           " "            "*"                 
##          in_spotify_charts in_apple_playlists in_apple_charts in_deezer_charts
## 1  ( 1 ) " "               " "                " "             " "             
## 2  ( 1 ) " "               "*"                " "             " "             
## 3  ( 1 ) "*"               "*"                " "             " "             
## 4  ( 1 ) "*"               "*"                " "             " "             
## 5  ( 1 ) "*"               "*"                " "             " "             
## 6  ( 1 ) "*"               "*"                " "             " "             
## 7  ( 1 ) "*"               "*"                " "             "*"             
## 8  ( 1 ) "*"               "*"                " "             "*"             
##          bpm keyA keyA# keyB keyC# keyD keyD# keyE keyF keyF# keyG keyG#
## 1  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 2  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 3  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 4  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 5  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 6  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 7  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 8  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
##          modeMinor danceability valence energy acousticness instrumentalness
## 1  ( 1 ) " "       " "          " "     " "    " "          " "             
## 2  ( 1 ) " "       " "          " "     " "    " "          " "             
## 3  ( 1 ) " "       " "          " "     " "    " "          " "             
## 4  ( 1 ) " "       " "          " "     "*"    " "          " "             
## 5  ( 1 ) " "       " "          " "     "*"    " "          " "             
## 6  ( 1 ) " "       " "          " "     "*"    " "          " "             
## 7  ( 1 ) " "       " "          " "     "*"    " "          " "             
## 8  ( 1 ) " "       " "          " "     "*"    " "          " "             
##          liveness speechiness
## 1  ( 1 ) " "      " "        
## 2  ( 1 ) " "      " "        
## 3  ( 1 ) " "      " "        
## 4  ( 1 ) " "      " "        
## 5  ( 1 ) " "      " "        
## 6  ( 1 ) " "      " "        
## 7  ( 1 ) " "      " "        
## 8  ( 1 ) " "      " "

Wnioski

Jak widzimy najistotniejszymi cechami jest to w ilu playlistach znajuje się na różnych platformach. Ważna okazuje się także energia, liczba artystów którzy pracowali przy piosence i co zaskakujące, to czy jest w kluczu G!

Jak można zobaczyć, funkcja regsubsets() domyślnie uwzględnia maksymalnie 8 predyktorów. Jeśli chcemy to zmienić, musimy użyć parametru nvmax.

dataframe_bs <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, nvmax = 19, really.big=T)
dataframe_bs_sum <- summary(dataframe_bs)
dataframe_bs_sum
## Subset selection object
## Call: regsubsets.formula(streams ~ . - track_name - artist.s._name - 
##     released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, 
##     nvmax = 19, really.big = T)
## 28 Variables  (and intercept)
##                      Forced in Forced out
## artist_count             FALSE      FALSE
## released_year            FALSE      FALSE
## released_month           FALSE      FALSE
## in_spotify_playlists     FALSE      FALSE
## in_spotify_charts        FALSE      FALSE
## in_apple_playlists       FALSE      FALSE
## in_apple_charts          FALSE      FALSE
## in_deezer_charts         FALSE      FALSE
## bpm                      FALSE      FALSE
## keyA                     FALSE      FALSE
## keyA#                    FALSE      FALSE
## keyB                     FALSE      FALSE
## keyC#                    FALSE      FALSE
## keyD                     FALSE      FALSE
## keyD#                    FALSE      FALSE
## keyE                     FALSE      FALSE
## keyF                     FALSE      FALSE
## keyF#                    FALSE      FALSE
## keyG                     FALSE      FALSE
## keyG#                    FALSE      FALSE
## modeMinor                FALSE      FALSE
## danceability             FALSE      FALSE
## valence                  FALSE      FALSE
## energy                   FALSE      FALSE
## acousticness             FALSE      FALSE
## instrumentalness         FALSE      FALSE
## liveness                 FALSE      FALSE
## speechiness              FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: exhaustive
##           artist_count released_year released_month in_spotify_playlists
## 1  ( 1 )  " "          " "           " "            "*"                 
## 2  ( 1 )  " "          " "           " "            "*"                 
## 3  ( 1 )  " "          " "           " "            "*"                 
## 4  ( 1 )  " "          " "           " "            "*"                 
## 5  ( 1 )  "*"          " "           " "            "*"                 
## 6  ( 1 )  "*"          "*"           " "            "*"                 
## 7  ( 1 )  "*"          "*"           " "            "*"                 
## 8  ( 1 )  "*"          "*"           " "            "*"                 
## 9  ( 1 )  "*"          "*"           " "            "*"                 
## 10  ( 1 ) "*"          "*"           " "            "*"                 
## 11  ( 1 ) "*"          "*"           "*"            "*"                 
## 12  ( 1 ) "*"          "*"           " "            "*"                 
## 13  ( 1 ) "*"          "*"           "*"            "*"                 
## 14  ( 1 ) "*"          "*"           "*"            "*"                 
## 15  ( 1 ) "*"          "*"           "*"            "*"                 
## 16  ( 1 ) "*"          "*"           "*"            "*"                 
## 17  ( 1 ) "*"          "*"           "*"            "*"                 
## 18  ( 1 ) "*"          "*"           "*"            "*"                 
## 19  ( 1 ) "*"          "*"           "*"            "*"                 
##           in_spotify_charts in_apple_playlists in_apple_charts in_deezer_charts
## 1  ( 1 )  " "               " "                " "             " "             
## 2  ( 1 )  " "               "*"                " "             " "             
## 3  ( 1 )  "*"               "*"                " "             " "             
## 4  ( 1 )  "*"               "*"                " "             " "             
## 5  ( 1 )  "*"               "*"                " "             " "             
## 6  ( 1 )  "*"               "*"                " "             " "             
## 7  ( 1 )  "*"               "*"                " "             "*"             
## 8  ( 1 )  "*"               "*"                " "             "*"             
## 9  ( 1 )  "*"               "*"                " "             "*"             
## 10  ( 1 ) "*"               "*"                "*"             "*"             
## 11  ( 1 ) "*"               "*"                "*"             "*"             
## 12  ( 1 ) "*"               "*"                "*"             "*"             
## 13  ( 1 ) "*"               "*"                "*"             "*"             
## 14  ( 1 ) "*"               "*"                "*"             "*"             
## 15  ( 1 ) "*"               "*"                "*"             "*"             
## 16  ( 1 ) "*"               "*"                "*"             "*"             
## 17  ( 1 ) "*"               "*"                "*"             "*"             
## 18  ( 1 ) "*"               "*"                "*"             "*"             
## 19  ( 1 ) "*"               "*"                "*"             "*"             
##           bpm keyA keyA# keyB keyC# keyD keyD# keyE keyF keyF# keyG keyG#
## 1  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 2  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 3  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 4  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 5  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 6  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 7  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 8  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 9  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 10  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 11  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 12  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  " "   "*"  " "  
## 13  ( 1 ) " " " "  "*"   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 14  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  " "   "*"  " "  
## 15  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  " "   "*"  " "  
## 16  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  "*"   "*"  " "  
## 17  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  "*"   "*"  " "  
## 18  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  "*"   "*"  " "  
## 19  ( 1 ) " " " "  "*"   " "  " "   "*"  " "   "*"  " "  "*"   "*"  " "  
##           modeMinor danceability valence energy acousticness instrumentalness
## 1  ( 1 )  " "       " "          " "     " "    " "          " "             
## 2  ( 1 )  " "       " "          " "     " "    " "          " "             
## 3  ( 1 )  " "       " "          " "     " "    " "          " "             
## 4  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 5  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 6  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 7  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 8  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 9  ( 1 )  " "       "*"          " "     "*"    " "          " "             
## 10  ( 1 ) " "       "*"          " "     "*"    " "          " "             
## 11  ( 1 ) " "       "*"          " "     "*"    " "          " "             
## 12  ( 1 ) " "       "*"          " "     "*"    " "          " "             
## 13  ( 1 ) " "       " "          " "     "*"    "*"          " "             
## 14  ( 1 ) " "       " "          " "     "*"    "*"          " "             
## 15  ( 1 ) " "       "*"          " "     "*"    "*"          " "             
## 16  ( 1 ) " "       "*"          " "     "*"    "*"          " "             
## 17  ( 1 ) " "       "*"          " "     "*"    "*"          "*"             
## 18  ( 1 ) " "       "*"          "*"     "*"    "*"          "*"             
## 19  ( 1 ) " "       "*"          "*"     "*"    "*"          "*"             
##           liveness speechiness
## 1  ( 1 )  " "      " "        
## 2  ( 1 )  " "      " "        
## 3  ( 1 )  " "      " "        
## 4  ( 1 )  " "      " "        
## 5  ( 1 )  " "      " "        
## 6  ( 1 )  " "      " "        
## 7  ( 1 )  " "      " "        
## 8  ( 1 )  " "      " "        
## 9  ( 1 )  " "      " "        
## 10  ( 1 ) " "      " "        
## 11  ( 1 ) " "      " "        
## 12  ( 1 ) " "      " "        
## 13  ( 1 ) " "      "*"        
## 14  ( 1 ) " "      "*"        
## 15  ( 1 ) " "      "*"        
## 16  ( 1 ) " "      "*"        
## 17  ( 1 ) " "      "*"        
## 18  ( 1 ) " "      "*"        
## 19  ( 1 ) " "      "*"

Wnioski

Widzimy, że w większości klucz nie ma pwływu na liczbę odsłuchań. Co ciekawe nieistotne jest też czy piosenka była wykonywana na żywo, a takze to czy jest pozytywna czy negatywna (valance)

Obiekt zwracany przez funkcję summary.regsubsets() zawiera informacje umożliwiające zidentyfikowanie globalnie najlepszego pozdbioru cech, np. miarę \(C_p\).

dataframe_bs_sum$cp
##  [1] 381.410414  62.707579  48.272729  32.812311  24.061786  14.880204
##  [7]   8.850438   7.597920   6.910704   6.168598   6.089836   6.078463
## [13]   6.000602   6.205914   6.745193   8.015665   9.375488  10.881186
## [19]  12.397510

Najlepszy podzbiór według kryterium BIC

bic_min <- which.min(dataframe_bs_sum$bic)
bic_min
## [1] 7
dataframe_bs_sum$bic[bic_min]
## [1] -1200.961

Stosowny obrazek

{
plot(dataframe_bs_sum$bic, xlab = "Liczba zmiennych", ylab = "BIC", col = "green",
     type = "b", pch = 20)
points(bic_min, dataframe_bs_sum$bic[bic_min], col = "red", pch = 9)
}

Wnioski

Najlepiej zachować około 7 cech, dokładanie kolejnych pogarasza model.

Dostępny jest też specjalny rodzaj wykresu (?plot.regsubsets).

plot(dataframe_bs, scale = "bic")

Estymaty współczynników dla optymalnego podzbioru

coef(dataframe_bs, id = 6)
##          (Intercept)         artist_count        released_year 
##        -6.185492e+09        -3.673632e+07         3.226564e+06 
## in_spotify_playlists    in_spotify_charts   in_apple_playlists 
##         3.669117e+04         1.944172e+06         2.671990e+06 
##               energy 
##        -2.348066e+06

[Zrób podobną analizę dla innych kryteriów optymalności: \(C_p\) i poprawionego \(R^2\). Zwróć uwagę na to, że poprawione \(R^2\) powinno być zmaksymalizowane.]

Najlepszy podzbiór według kryterium \(C_p\)

cp_min <- which.min(dataframe_bs_sum$cp)
cp_min
## [1] 13
dataframe_bs_sum$cp[cp_min]
## [1] 6.000602

Stosowny obrazek

{
plot(dataframe_bs_sum$cp, xlab = "Liczba zmiennych", ylab = "BIC", col = "green",
     type = "b", pch = 20)
points(cp_min, dataframe_bs_sum$cp[cp_min], col = "red", pch = 9)
}

Dostępny jest też specjalny rodzaj wykresu (?plot.regsubsets).

plot(dataframe_bs, scale = "Cp")

Estymaty współczynników dla optymalnego podzbioru

coef(dataframe_bs, id = 6)
##          (Intercept)         artist_count        released_year 
##        -6.185492e+09        -3.673632e+07         3.226564e+06 
## in_spotify_playlists    in_spotify_charts   in_apple_playlists 
##         3.669117e+04         1.944172e+06         2.671990e+06 
##               energy 
##        -2.348066e+06

Selekcja krokowa do przodu i wstecz

Funkcja regsubsets() z odpowiednio ustawionym parametrem method może przeprowadzić selekcję krokową.

dataframe_fwd <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, nvmax = 19, 
                          method = "forward")
dataframe_fwd_sum <- summary(dataframe_fwd)
dataframe_fwd_sum
## Subset selection object
## Call: regsubsets.formula(streams ~ . - track_name - artist.s._name - 
##     released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, 
##     nvmax = 19, method = "forward")
## 28 Variables  (and intercept)
##                      Forced in Forced out
## artist_count             FALSE      FALSE
## released_year            FALSE      FALSE
## released_month           FALSE      FALSE
## in_spotify_playlists     FALSE      FALSE
## in_spotify_charts        FALSE      FALSE
## in_apple_playlists       FALSE      FALSE
## in_apple_charts          FALSE      FALSE
## in_deezer_charts         FALSE      FALSE
## bpm                      FALSE      FALSE
## keyA                     FALSE      FALSE
## keyA#                    FALSE      FALSE
## keyB                     FALSE      FALSE
## keyC#                    FALSE      FALSE
## keyD                     FALSE      FALSE
## keyD#                    FALSE      FALSE
## keyE                     FALSE      FALSE
## keyF                     FALSE      FALSE
## keyF#                    FALSE      FALSE
## keyG                     FALSE      FALSE
## keyG#                    FALSE      FALSE
## modeMinor                FALSE      FALSE
## danceability             FALSE      FALSE
## valence                  FALSE      FALSE
## energy                   FALSE      FALSE
## acousticness             FALSE      FALSE
## instrumentalness         FALSE      FALSE
## liveness                 FALSE      FALSE
## speechiness              FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           artist_count released_year released_month in_spotify_playlists
## 1  ( 1 )  " "          " "           " "            "*"                 
## 2  ( 1 )  " "          " "           " "            "*"                 
## 3  ( 1 )  " "          " "           " "            "*"                 
## 4  ( 1 )  " "          " "           " "            "*"                 
## 5  ( 1 )  "*"          " "           " "            "*"                 
## 6  ( 1 )  "*"          "*"           " "            "*"                 
## 7  ( 1 )  "*"          "*"           " "            "*"                 
## 8  ( 1 )  "*"          "*"           " "            "*"                 
## 9  ( 1 )  "*"          "*"           " "            "*"                 
## 10  ( 1 ) "*"          "*"           " "            "*"                 
## 11  ( 1 ) "*"          "*"           "*"            "*"                 
## 12  ( 1 ) "*"          "*"           "*"            "*"                 
## 13  ( 1 ) "*"          "*"           "*"            "*"                 
## 14  ( 1 ) "*"          "*"           "*"            "*"                 
## 15  ( 1 ) "*"          "*"           "*"            "*"                 
## 16  ( 1 ) "*"          "*"           "*"            "*"                 
## 17  ( 1 ) "*"          "*"           "*"            "*"                 
## 18  ( 1 ) "*"          "*"           "*"            "*"                 
## 19  ( 1 ) "*"          "*"           "*"            "*"                 
##           in_spotify_charts in_apple_playlists in_apple_charts in_deezer_charts
## 1  ( 1 )  " "               " "                " "             " "             
## 2  ( 1 )  " "               "*"                " "             " "             
## 3  ( 1 )  "*"               "*"                " "             " "             
## 4  ( 1 )  "*"               "*"                " "             " "             
## 5  ( 1 )  "*"               "*"                " "             " "             
## 6  ( 1 )  "*"               "*"                " "             " "             
## 7  ( 1 )  "*"               "*"                " "             "*"             
## 8  ( 1 )  "*"               "*"                " "             "*"             
## 9  ( 1 )  "*"               "*"                " "             "*"             
## 10  ( 1 ) "*"               "*"                "*"             "*"             
## 11  ( 1 ) "*"               "*"                "*"             "*"             
## 12  ( 1 ) "*"               "*"                "*"             "*"             
## 13  ( 1 ) "*"               "*"                "*"             "*"             
## 14  ( 1 ) "*"               "*"                "*"             "*"             
## 15  ( 1 ) "*"               "*"                "*"             "*"             
## 16  ( 1 ) "*"               "*"                "*"             "*"             
## 17  ( 1 ) "*"               "*"                "*"             "*"             
## 18  ( 1 ) "*"               "*"                "*"             "*"             
## 19  ( 1 ) "*"               "*"                "*"             "*"             
##           bpm keyA keyA# keyB keyC# keyD keyD# keyE keyF keyF# keyG keyG#
## 1  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 2  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 3  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 4  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 5  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 6  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 7  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 8  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 9  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 10  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 11  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 12  ( 1 ) " " " "  "*"   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 13  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  " "   "*"  " "  
## 14  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  " "   "*"  " "  
## 15  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  " "   "*"  " "  
## 16  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  "*"   "*"  " "  
## 17  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  "*"   "*"  " "  
## 18  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  "*"   "*"  " "  
## 19  ( 1 ) " " " "  "*"   " "  " "   "*"  " "   "*"  " "  "*"   "*"  " "  
##           modeMinor danceability valence energy acousticness instrumentalness
## 1  ( 1 )  " "       " "          " "     " "    " "          " "             
## 2  ( 1 )  " "       " "          " "     " "    " "          " "             
## 3  ( 1 )  " "       " "          " "     " "    " "          " "             
## 4  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 5  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 6  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 7  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 8  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 9  ( 1 )  " "       "*"          " "     "*"    " "          " "             
## 10  ( 1 ) " "       "*"          " "     "*"    " "          " "             
## 11  ( 1 ) " "       "*"          " "     "*"    " "          " "             
## 12  ( 1 ) " "       "*"          " "     "*"    " "          " "             
## 13  ( 1 ) " "       "*"          " "     "*"    " "          " "             
## 14  ( 1 ) " "       "*"          " "     "*"    "*"          " "             
## 15  ( 1 ) " "       "*"          " "     "*"    "*"          " "             
## 16  ( 1 ) " "       "*"          " "     "*"    "*"          " "             
## 17  ( 1 ) " "       "*"          " "     "*"    "*"          "*"             
## 18  ( 1 ) " "       "*"          "*"     "*"    "*"          "*"             
## 19  ( 1 ) " "       "*"          "*"     "*"    "*"          "*"             
##           liveness speechiness
## 1  ( 1 )  " "      " "        
## 2  ( 1 )  " "      " "        
## 3  ( 1 )  " "      " "        
## 4  ( 1 )  " "      " "        
## 5  ( 1 )  " "      " "        
## 6  ( 1 )  " "      " "        
## 7  ( 1 )  " "      " "        
## 8  ( 1 )  " "      " "        
## 9  ( 1 )  " "      " "        
## 10  ( 1 ) " "      " "        
## 11  ( 1 ) " "      " "        
## 12  ( 1 ) " "      " "        
## 13  ( 1 ) " "      " "        
## 14  ( 1 ) " "      " "        
## 15  ( 1 ) " "      "*"        
## 16  ( 1 ) " "      "*"        
## 17  ( 1 ) " "      "*"        
## 18  ( 1 ) " "      "*"        
## 19  ( 1 ) " "      "*"
dataframe_back <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, nvmax = 19, 
                           method = "backward")
dataframe_back_sum <- summary(dataframe_back)
dataframe_back_sum
## Subset selection object
## Call: regsubsets.formula(streams ~ . - track_name - artist.s._name - 
##     released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, 
##     nvmax = 19, method = "backward")
## 28 Variables  (and intercept)
##                      Forced in Forced out
## artist_count             FALSE      FALSE
## released_year            FALSE      FALSE
## released_month           FALSE      FALSE
## in_spotify_playlists     FALSE      FALSE
## in_spotify_charts        FALSE      FALSE
## in_apple_playlists       FALSE      FALSE
## in_apple_charts          FALSE      FALSE
## in_deezer_charts         FALSE      FALSE
## bpm                      FALSE      FALSE
## keyA                     FALSE      FALSE
## keyA#                    FALSE      FALSE
## keyB                     FALSE      FALSE
## keyC#                    FALSE      FALSE
## keyD                     FALSE      FALSE
## keyD#                    FALSE      FALSE
## keyE                     FALSE      FALSE
## keyF                     FALSE      FALSE
## keyF#                    FALSE      FALSE
## keyG                     FALSE      FALSE
## keyG#                    FALSE      FALSE
## modeMinor                FALSE      FALSE
## danceability             FALSE      FALSE
## valence                  FALSE      FALSE
## energy                   FALSE      FALSE
## acousticness             FALSE      FALSE
## instrumentalness         FALSE      FALSE
## liveness                 FALSE      FALSE
## speechiness              FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
##           artist_count released_year released_month in_spotify_playlists
## 1  ( 1 )  " "          " "           " "            "*"                 
## 2  ( 1 )  " "          " "           " "            "*"                 
## 3  ( 1 )  " "          " "           " "            "*"                 
## 4  ( 1 )  " "          " "           " "            "*"                 
## 5  ( 1 )  "*"          " "           " "            "*"                 
## 6  ( 1 )  "*"          "*"           " "            "*"                 
## 7  ( 1 )  "*"          "*"           " "            "*"                 
## 8  ( 1 )  "*"          "*"           " "            "*"                 
## 9  ( 1 )  "*"          "*"           " "            "*"                 
## 10  ( 1 ) "*"          "*"           " "            "*"                 
## 11  ( 1 ) "*"          "*"           "*"            "*"                 
## 12  ( 1 ) "*"          "*"           "*"            "*"                 
## 13  ( 1 ) "*"          "*"           "*"            "*"                 
## 14  ( 1 ) "*"          "*"           "*"            "*"                 
## 15  ( 1 ) "*"          "*"           "*"            "*"                 
## 16  ( 1 ) "*"          "*"           "*"            "*"                 
## 17  ( 1 ) "*"          "*"           "*"            "*"                 
## 18  ( 1 ) "*"          "*"           "*"            "*"                 
## 19  ( 1 ) "*"          "*"           "*"            "*"                 
##           in_spotify_charts in_apple_playlists in_apple_charts in_deezer_charts
## 1  ( 1 )  " "               " "                " "             " "             
## 2  ( 1 )  " "               "*"                " "             " "             
## 3  ( 1 )  "*"               "*"                " "             " "             
## 4  ( 1 )  "*"               "*"                " "             " "             
## 5  ( 1 )  "*"               "*"                " "             " "             
## 6  ( 1 )  "*"               "*"                " "             " "             
## 7  ( 1 )  "*"               "*"                " "             "*"             
## 8  ( 1 )  "*"               "*"                " "             "*"             
## 9  ( 1 )  "*"               "*"                " "             "*"             
## 10  ( 1 ) "*"               "*"                "*"             "*"             
## 11  ( 1 ) "*"               "*"                "*"             "*"             
## 12  ( 1 ) "*"               "*"                "*"             "*"             
## 13  ( 1 ) "*"               "*"                "*"             "*"             
## 14  ( 1 ) "*"               "*"                "*"             "*"             
## 15  ( 1 ) "*"               "*"                "*"             "*"             
## 16  ( 1 ) "*"               "*"                "*"             "*"             
## 17  ( 1 ) "*"               "*"                "*"             "*"             
## 18  ( 1 ) "*"               "*"                "*"             "*"             
## 19  ( 1 ) "*"               "*"                "*"             "*"             
##           bpm keyA keyA# keyB keyC# keyD keyD# keyE keyF keyF# keyG keyG#
## 1  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 2  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 3  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 4  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 5  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 6  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 7  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   " "  " "  
## 8  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 9  ( 1 )  " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 10  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 11  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 12  ( 1 ) " " " "  " "   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 13  ( 1 ) " " " "  "*"   " "  " "   " "  " "   " "  " "  " "   "*"  " "  
## 14  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  " "   "*"  " "  
## 15  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  " "   "*"  " "  
## 16  ( 1 ) " " " "  "*"   " "  " "   " "  " "   "*"  " "  " "   "*"  " "  
## 17  ( 1 ) " " " "  "*"   " "  " "   " "  "*"   "*"  " "  " "   "*"  " "  
## 18  ( 1 ) " " " "  "*"   "*"  " "   " "  "*"   "*"  " "  " "   "*"  " "  
## 19  ( 1 ) " " " "  "*"   "*"  " "   " "  "*"   "*"  "*"  " "   "*"  " "  
##           modeMinor danceability valence energy acousticness instrumentalness
## 1  ( 1 )  " "       " "          " "     " "    " "          " "             
## 2  ( 1 )  " "       " "          " "     " "    " "          " "             
## 3  ( 1 )  " "       " "          " "     " "    " "          " "             
## 4  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 5  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 6  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 7  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 8  ( 1 )  " "       " "          " "     "*"    " "          " "             
## 9  ( 1 )  " "       " "          " "     "*"    "*"          " "             
## 10  ( 1 ) " "       " "          " "     "*"    "*"          " "             
## 11  ( 1 ) " "       " "          " "     "*"    "*"          " "             
## 12  ( 1 ) " "       " "          " "     "*"    "*"          " "             
## 13  ( 1 ) " "       " "          " "     "*"    "*"          " "             
## 14  ( 1 ) " "       " "          " "     "*"    "*"          " "             
## 15  ( 1 ) " "       "*"          " "     "*"    "*"          " "             
## 16  ( 1 ) " "       "*"          " "     "*"    "*"          "*"             
## 17  ( 1 ) " "       "*"          " "     "*"    "*"          "*"             
## 18  ( 1 ) " "       "*"          " "     "*"    "*"          "*"             
## 19  ( 1 ) " "       "*"          " "     "*"    "*"          "*"             
##           liveness speechiness
## 1  ( 1 )  " "      " "        
## 2  ( 1 )  " "      " "        
## 3  ( 1 )  " "      " "        
## 4  ( 1 )  " "      " "        
## 5  ( 1 )  " "      " "        
## 6  ( 1 )  " "      " "        
## 7  ( 1 )  " "      " "        
## 8  ( 1 )  " "      " "        
## 9  ( 1 )  " "      " "        
## 10  ( 1 ) " "      " "        
## 11  ( 1 ) " "      " "        
## 12  ( 1 ) " "      "*"        
## 13  ( 1 ) " "      "*"        
## 14  ( 1 ) " "      "*"        
## 15  ( 1 ) " "      "*"        
## 16  ( 1 ) " "      "*"        
## 17  ( 1 ) " "      "*"        
## 18  ( 1 ) " "      "*"        
## 19  ( 1 ) " "      "*"

Wnioski

Wnioski takie same jak przy selekcji nie-krokowwej.

Wybór modelu przy pomocy metody zbioru walidacyjnego

Estymaty błędów testowych będą dokładne tylko jeśli wszystkie aspekty dopasowania modelu — w tym selekcję zmiennych — przeprowadzimy z użyciem wyłącznie zbioru uczącego.

n <- nrow(dataframe)
train <- sample(c(TRUE, FALSE), n, replace = TRUE)
test <- !train
dataframe_bs_v <- regsubsets(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe[train,], nvmax = 19)

Niestety dla modeli zwracanych przez regsubsets nie ma odpowiedniej metody predict(). Może ona mieć następującą postać (funkcja model.matrix() tworzy macierz \(X\) dla podanych punktów).

predict.regsubsets <- function(object, newdata, id, ...) {
  model_formula <- as.formula(object$call[[2]])
  mat <- model.matrix(model_formula, newdata)
  coefs <- coef(object, id = id)
  mat[, names(coefs)] %*% coefs
}

Liczymy estymaty błędów

prediction_error <- function(i, model, subset) {
  pred <- predict(model, dataframe[subset,], id = i)
  mean((dataframe$streams[subset] - pred)^2)
}
val_errors <- sapply(1:19, prediction_error, model = dataframe_bs_v, subset = test)
val_errors
##  [1] 1.580494e+17 1.092670e+17 1.085900e+17 1.068237e+17 1.075376e+17
##  [6] 1.086365e+17 1.079801e+17 1.055658e+17 1.071522e+17 1.073144e+17
## [11] 1.070331e+17 1.092011e+17 1.088255e+17 1.094225e+17 1.095242e+17
## [16] 1.093596e+17 1.100975e+17 1.093578e+17 1.106267e+17

Wnioski

Optymalny model zawiera tylko 4 cechy

Regularyzacja

Ładnowanie danych

dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"

dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
head(dataframe)
##                            track_name    artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.)  Latto, Jung Kook            2
## 2                                LALA       Myke Towers            1
## 3                             vampire    Olivia Rodrigo            1
## 4                        Cruel Summer      Taylor Swift            1
## 5                      WHERE SHE GOES         Bad Bunny            1
## 6                            Sprinter Dave, Central Cee            2
##   released_year released_month released_day in_spotify_playlists
## 1          2023              7           14                  553
## 2          2023              3           23                 1474
## 3          2023              6           30                 1397
## 4          2019              8           23                 7858
## 5          2023              5           18                 3133
## 6          2023              6            1                 2186
##   in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1               147 141381703                 43             263
## 2                48 133716286                 48             126
## 3               113 140003974                 94             207
## 4               100 800840817                116             207
## 5                50 303236322                 84             133
## 6                91 183706234                 67             213
##   in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                  45               10              826 125   B Major
## 2                  58               14              382  92  C# Major
## 3                  91               14              949 138   F Major
## 4                 125               12              548 170   A Major
## 5                  87               15              425 144   A Minor
## 6                  88               17              946 141  C# Major
##   danceability valence energy acousticness instrumentalness liveness
## 1           80      89     83           31                0        8
## 2           71      61     74            7                0       10
## 3           51      32     53           17                0       31
## 4           55      58     72           11                0       11
## 5           65      23     80           14               63       11
## 6           92      66     58           19                0        8
##   speechiness
## 1           4
## 2           4
## 3           6
## 4          15
## 5           6
## 6          24
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
dataframe <- na.omit(dataframe)

Regularyzacja

Obie omawiane na wykładzie metody regularyzacji są zaimplementowane w funkcji glmnet() z pakietu glmnet.

Funkcja glmnet::glmnet() ma składnię odmienną od lm() i jej podobnych. Dane wejściowe muszą być podane odmiennie. Trzeba w szczególności samodzielnie skonstruować macierz \(\mathbf{X}\)

X <- model.matrix(streams ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe)[, -1]
y <- dataframe$streams

Argument alpha funkcji glmnet() decyduje o typie użytej regularyzacji: 0 oznacza regresję grzbietową, a 1 lasso.

Regresja grzbietowa

Wykonujemy regresję grzbietową dla jawnie określonych wartości \(\lambda\). Podany ciąg \(\lambda\) powinien być malejący. Funkcja glmnet() domyślnie dokonuje standaryzacji zmiennych.

lambda_grid <- 10^seq(10, -2, length.out = 100)
fit_ridge <- glmnet(X, y, alpha = 0, lambda = lambda_grid)

Dla każdej wartości \(\lambda\) otrzymujemy zestaw estymat predyktorów dostępnych w postaci macierzy

dim(coef(fit_ridge))
## [1]  29 100

Można sprawdzić, że większe wartości \(\lambda\) dają mniejszą normę euklidesową współczynników (pomijamy wyraz wolny).

fit_ridge$lambda[50]
## [1] 11497.57
coef_ridge <- coef(fit_ridge)[, 50]
coef_ridge
##          (Intercept)         artist_count        released_year 
##        -6.610929e+09        -3.454073e+07         3.416056e+06 
##       released_month in_spotify_playlists    in_spotify_charts 
##         3.842087e+06         3.565754e+04         3.545401e+06 
##   in_apple_playlists      in_apple_charts     in_deezer_charts 
##         2.929355e+06        -4.342749e+05        -6.295116e+06 
##                  bpm                 keyA                keyA# 
##        -5.486203e+04         2.151831e+06         6.729063e+07 
##                 keyB                keyC#                 keyD 
##         2.289709e+07         5.624426e+05        -1.327940e+07 
##                keyD#                 keyE                 keyF 
##         3.669045e+07         5.070821e+07         2.182324e+07 
##                keyF#                 keyG                keyG# 
##        -2.444262e+07        -4.475443e+07         6.016687e+06 
##            modeMinor         danceability              valence 
##         6.418007e+06        -6.339861e+05        -3.622206e+05 
##               energy         acousticness     instrumentalness 
##        -1.236859e+06         7.331429e+05        -1.016489e+06 
##             liveness          speechiness 
##         3.975903e+05        -1.312226e+06
sqrt(sum(coef_ridge[-1]^2))
## [1] 116581007

Natomiast mniejsze wartości \(\lambda\) dają większą normę euklidesową współczynników

fit_ridge$lambda[70]
## [1] 43.28761
coef(fit_ridge)[, 70]
##          (Intercept)         artist_count        released_year 
##        -6.611587e+09        -3.454144e+07         3.416377e+06 
##       released_month in_spotify_playlists    in_spotify_charts 
##         3.842226e+06         3.565787e+04         3.545692e+06 
##   in_apple_playlists      in_apple_charts     in_deezer_charts 
##         2.929441e+06        -4.343660e+05        -6.296086e+06 
##                  bpm                 keyA                keyA# 
##        -5.486096e+04         2.154622e+06         6.729348e+07 
##                 keyB                keyC#                 keyD 
##         2.289887e+07         5.621571e+05        -1.327873e+07 
##                keyD#                 keyE                 keyF 
##         3.669365e+07         5.071000e+07         2.182568e+07 
##                keyF#                 keyG                keyG# 
##        -2.444270e+07        -4.475287e+07         6.018289e+06 
##            modeMinor         danceability              valence 
##         6.418594e+06        -6.339685e+05        -3.622077e+05 
##               energy         acousticness     instrumentalness 
##        -1.236851e+06         7.331943e+05        -1.016450e+06 
##             liveness          speechiness 
##         3.976559e+05        -1.312262e+06
sqrt(sum(coef(fit_ridge)[-1, 70]^2))
## [1] 116585042

Przy pomocy funkcji predict.glmnet() można uzyskać np. wartości estymat współczynników dla nowej wartości \(\lambda\) (np. 50)

predict(fit_ridge, s = 50, type = "coefficients")
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -6.611587e+09
## artist_count         -3.454144e+07
## released_year         3.416377e+06
## released_month        3.842226e+06
## in_spotify_playlists  3.565787e+04
## in_spotify_charts     3.545692e+06
## in_apple_playlists    2.929440e+06
## in_apple_charts      -4.343659e+05
## in_deezer_charts     -6.296085e+06
## bpm                  -5.486096e+04
## keyA                  2.154620e+06
## keyA#                 6.729348e+07
## keyB                  2.289886e+07
## keyC#                 5.621573e+05
## keyD                 -1.327873e+07
## keyD#                 3.669365e+07
## keyE                  5.071000e+07
## keyF                  2.182568e+07
## keyF#                -2.444270e+07
## keyG                 -4.475287e+07
## keyG#                 6.018288e+06
## modeMinor             6.418593e+06
## danceability         -6.339685e+05
## valence              -3.622077e+05
## energy               -1.236851e+06
## acousticness          7.331943e+05
## instrumentalness     -1.016450e+06
## liveness              3.976558e+05
## speechiness          -1.312262e+06

Estymujemy testowy MSE

set.seed(1)
n <- nrow(X)
train <- sample(n, n / 2)
test <- -train
fit_ridge <- glmnet(X[train,], y[train], alpha = 0, lambda = lambda_grid,
                    thresh = 1e-12)

Dla \(\lambda = 4\)

pred_ridge <- predict(fit_ridge, s = 4, newx = X[test,])
mean((pred_ridge - y[test])^2)
## [1] 7.991473e+16

Testowy MSE dla modelu zerowego (sam wyraz wolny)

pred_null <- mean(y[train])
mean((pred_null - y[test])^2)
## [1] 2.800065e+17

Testowy MSE dla bardzo dużej wartości \(\lambda = 10^{10}\)

pred_ridge_big <- predict(fit_ridge, s = 1e10, newx = X[test,])
mean((pred_ridge_big - y[test])^2)
## [1] 2.363601e+17

[Jak wygląda porównanie?]

Testowy MSE dla \(\lambda = 0\) (metoda najmniejszych kwadratów)

pred_ridge_0 <- predict(fit_ridge, x = X[train,], y = y[train], s = 0, 
                      newx = X[test,], exact = TRUE)
mean((pred_ridge_0 - y[test])^2)
## [1] 7.991473e+16

Porównanie estymat współczynników

lm(y ~ X, subset = train)
## 
## Call:
## lm(formula = y ~ X, subset = train)
## 
## Coefficients:
##           (Intercept)          Xartist_count         Xreleased_year  
##            -7.701e+09             -3.791e+07              3.965e+06  
##       Xreleased_month  Xin_spotify_playlists     Xin_spotify_charts  
##             3.872e+05              3.557e+04              3.706e+06  
##   Xin_apple_playlists       Xin_apple_charts      Xin_deezer_charts  
##             2.931e+06             -2.831e+05             -6.724e+06  
##                  Xbpm                  XkeyA                 XkeyA#  
##            -3.769e+05             -1.865e+07              8.209e+07  
##                 XkeyB                 XkeyC#                  XkeyD  
##             2.711e+06              1.401e+07             -1.708e+07  
##                XkeyD#                  XkeyE                  XkeyF  
##             6.837e+07              1.088e+08             -2.711e+06  
##                XkeyF#                  XkeyG                 XkeyG#  
##             8.864e+06              5.909e+06              4.258e+07  
##            XmodeMinor          Xdanceability               Xvalence  
##             4.315e+07              8.028e+05             -1.345e+06  
##               Xenergy          Xacousticness      Xinstrumentalness  
##            -1.701e+06              1.159e+06             -1.953e+06  
##             Xliveness           Xspeechiness  
##            -3.285e+05             -1.221e+06
predict(fit_ridge, x = X[train,], y = y[train], s = 0, exact = TRUE, 
        type = "coefficients")[1:20,]
##          (Intercept)         artist_count        released_year 
##        -7.701100e+09        -3.791275e+07         3.964665e+06 
##       released_month in_spotify_playlists    in_spotify_charts 
##         3.871685e+05         3.556997e+04         3.706003e+06 
##   in_apple_playlists      in_apple_charts     in_deezer_charts 
##         2.930568e+06        -2.831441e+05        -6.724399e+06 
##                  bpm                 keyA                keyA# 
##        -3.769107e+05        -1.864849e+07         8.208962e+07 
##                 keyB                keyC#                 keyD 
##         2.711360e+06         1.400837e+07        -1.708006e+07 
##                keyD#                 keyE                 keyF 
##         6.837425e+07         1.088185e+08        -2.711139e+06 
##                keyF#                 keyG 
##         8.863846e+06         5.909349e+06

Wyliczenie optymalnej wartości \(\lambda\) przy pomocy walidacji krzyżowej

set.seed(1)
cv_out <- cv.glmnet(X[train,], y[train], alpha = 0)
plot(cv_out)

cv_out$lambda.min
## [1] 47707343

MSE dla optymalnego \(\lambda\)

pred_ridge_opt <- predict(fit_ridge, s = cv_out$lambda.min, newx = X[test,])
mean((pred_ridge_opt - y[test])^2)
## [1] 7.984013e+16

Estymaty współczynników dla optymalnego \(\lambda\)

fit_ridge_full <- glmnet(X, y, alpha = 0)
predict(fit_ridge_full, s = cv_out$lambda.min, type = "coefficients")
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                 s1
## (Intercept)          -4.052075e+09
## artist_count         -3.261450e+07
## released_year         2.164381e+06
## released_month        3.142019e+06
## in_spotify_playlists  3.315303e+04
## in_spotify_charts     2.824766e+06
## in_apple_playlists    2.725871e+06
## in_apple_charts      -1.499768e+05
## in_deezer_charts     -3.733507e+06
## bpm                  -7.593654e+04
## keyA                 -4.495307e+06
## keyA#                 6.159003e+07
## keyB                  1.948485e+07
## keyC#                 4.867638e+06
## keyD                 -1.279375e+07
## keyD#                 3.134895e+07
## keyE                  4.724777e+07
## keyF                  1.675519e+07
## keyF#                -2.081412e+07
## keyG                 -4.632411e+07
## keyG#                 3.476638e+06
## modeMinor             4.111337e+06
## danceability         -7.075421e+05
## valence              -4.306323e+05
## energy               -1.187928e+06
## acousticness          5.700174e+05
## instrumentalness     -1.112572e+06
## liveness              1.838621e+05
## speechiness          -1.174002e+06

Wnioski

Regularyzacja nic nie pomogła.

Lasso

Dopasowujemy lasso dla ustalonej siatki parametrów regularyzacji

fit_lasso <- glmnet(X[train,], y[train], alpha = 1)
plot(fit_lasso, xvar = "lambda")

Wykonujemy walidację krzyżową i liczymy estymatę MSE

cv_out <- cv.glmnet(X[train,], y[train], alpha = 1)
plot(cv_out)

cv_out$lambda.min
## [1] 15262853
pred_lasso <- predict(fit_lasso, s = cv_out$lambda.min, newx = X[test,])
mean((pred_lasso - y[test])^2)
## [1] 7.652867e+16

[Jak wygląda porównanie z modelem zerowym, metodą najmniejszych kwadratów i regresją grzbietową?]

Estymaty współczynników dla optymalnego \(\lambda\)

fit_lasso_full <- glmnet(X, y, alpha = 1)
predict(fit_lasso_full, s = cv_out$lambda.min, type = "coefficients")[1:20,]
##          (Intercept)         artist_count        released_year 
##        -1.338756e+09        -2.156827e+07         7.895307e+05 
##       released_month in_spotify_playlists    in_spotify_charts 
##         0.000000e+00         3.394806e+04         1.415269e+06 
##   in_apple_playlists      in_apple_charts     in_deezer_charts 
##         2.639222e+06         0.000000e+00         0.000000e+00 
##                  bpm                 keyA                keyA# 
##         0.000000e+00         0.000000e+00         1.192251e+06 
##                 keyB                keyC#                 keyD 
##         0.000000e+00         0.000000e+00         0.000000e+00 
##                keyD#                 keyE                 keyF 
##         0.000000e+00         7.476343e+05         0.000000e+00 
##                keyF#                 keyG 
##         0.000000e+00        -1.254109e+07

Modele nieliniowe

Ładowanie zbioru danych

dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"

dataframe = transform(dataframe, streams = as.numeric(streams))
## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
## by coercion
dataframe = na.omit(dataframe)

head(dataframe)
##                            track_name    artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.)  Latto, Jung Kook            2
## 2                                LALA       Myke Towers            1
## 3                             vampire    Olivia Rodrigo            1
## 4                        Cruel Summer      Taylor Swift            1
## 5                      WHERE SHE GOES         Bad Bunny            1
## 6                            Sprinter Dave, Central Cee            2
##   released_year released_month released_day in_spotify_playlists
## 1          2023              7           14                  553
## 2          2023              3           23                 1474
## 3          2023              6           30                 1397
## 4          2019              8           23                 7858
## 5          2023              5           18                 3133
## 6          2023              6            1                 2186
##   in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1               147 141381703                 43             263
## 2                48 133716286                 48             126
## 3               113 140003974                 94             207
## 4               100 800840817                116             207
## 5                50 303236322                 84             133
## 6                91 183706234                 67             213
##   in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                  45               10              826 125   B Major
## 2                  58               14              382  92  C# Major
## 3                  91               14              949 138   F Major
## 4                 125               12              548 170   A Major
## 5                  87               15              425 144   A Minor
## 6                  88               17              946 141  C# Major
##   danceability valence energy acousticness instrumentalness liveness
## 1           80      89     83           31                0        8
## 2           71      61     74            7                0       10
## 3           51      32     53           17                0       31
## 4           55      58     72           11                0       11
## 5           65      23     80           14               63       11
## 6           92      66     58           19                0        8
##   speechiness
## 1           4
## 2           4
## 3           6
## 4          15
## 5           6
## 6          24
library(splines)
library(gam)
## Loading required package: foreach
## Loaded gam 1.22-3

Modele nieliniowe

Regresja wielomianowa

fit_poly <- lm(streams ~ poly(energy, 4), data = dataframe)
summary(fit_poly)
## 
## Call:
## lm(formula = streams ~ poly(energy, 4), data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -525902947 -372594811 -224310311  152574110 3198091315 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       514137425   18385120  27.965   <2e-16 ***
## poly(energy, 4)1 -455403405  567263618  -0.803    0.422    
## poly(energy, 4)2 -494261264  567263618  -0.871    0.384    
## poly(energy, 4)3  345346985  567263618   0.609    0.543    
## poly(energy, 4)4 -526866803  567263618  -0.929    0.353    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 567300000 on 947 degrees of freedom
## Multiple R-squared:  0.002777,   Adjusted R-squared:  -0.001435 
## F-statistic: 0.6592 on 4 and 947 DF,  p-value: 0.6204

To samo z użyciem standardowej bazy wielomianów \(X, X^2, X^3, X^4\).

fit_poly_raw <- lm(streams ~ poly(energy, 4, raw = TRUE), data = dataframe)
summary(fit_poly_raw)
## 
## Call:
## lm(formula = streams ~ poly(energy, 4, raw = TRUE), data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -525902947 -372594811 -224310311  152574110 3198091315 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                  -4.413e+08  7.947e+08  -0.555    0.579
## poly(energy, 4, raw = TRUE)1  7.654e+07  6.652e+07   1.150    0.250
## poly(energy, 4, raw = TRUE)2 -2.086e+06  1.962e+06  -1.063    0.288
## poly(energy, 4, raw = TRUE)3  2.388e+04  2.422e+04   0.986    0.325
## poly(energy, 4, raw = TRUE)4 -9.893e+01  1.065e+02  -0.929    0.353
## 
## Residual standard error: 567300000 on 947 degrees of freedom
## Multiple R-squared:  0.002777,   Adjusted R-squared:  -0.001435 
## F-statistic: 0.6592 on 4 and 947 DF,  p-value: 0.6204

To samo, co powyżej, inaczej zapisane

fit_poly_raw <- lm(streams ~ energy + I(energy^2) + I(energy^3) + I(energy^4), data = dataframe)
summary(fit_poly_raw)
## 
## Call:
## lm(formula = streams ~ energy + I(energy^2) + I(energy^3) + I(energy^4), 
##     data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -525902947 -372594811 -224310311  152574110 3198091315 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.413e+08  7.947e+08  -0.555    0.579
## energy       7.654e+07  6.652e+07   1.150    0.250
## I(energy^2) -2.086e+06  1.962e+06  -1.063    0.288
## I(energy^3)  2.388e+04  2.422e+04   0.986    0.325
## I(energy^4) -9.893e+01  1.065e+02  -0.929    0.353
## 
## Residual standard error: 567300000 on 947 degrees of freedom
## Multiple R-squared:  0.002777,   Adjusted R-squared:  -0.001435 
## F-statistic: 0.6592 on 4 and 947 DF,  p-value: 0.6204

Obrazek dopasowania zawierający krzywe błędu standardowego.

energy_lims <- range(dataframe$energy)
energy_grid <- seq(energy_lims[1], energy_lims[2])
pred_poly <- predict(fit_poly, list(energy = energy_grid), se.fit = TRUE)
se_bands <- cbind(pred_poly$fit + 2 * pred_poly$se.fit, 
                  pred_poly$fit - 2 * pred_poly$se.fit)
{
plot(dataframe$energy, dataframe$streams, col = "darkgrey", cex = 0.5, xlim = energy_lims)
lines(energy_grid, pred_poly$fit, col = "red", lwd = 2)
matlines(energy_grid, se_bands, col = "red", lty = "dashed")
}

Wnioski

Ponownie mmierzymy się z tym, że liczba odsłuchań jest trudna do przewidzenia.

Regresja logistyczna wielomianowa

Chcemy skonstruować klasyfikator z dwoma klasami: popularne piosenki (więcej niż 10^9 odsłuchań) i mniej popularne

fit_log_poly <- glm(I(streams > 1000000000) ~ poly(energy, 4), data = dataframe, family = binomial)

Funkcja predict.glm() standardowo zwraca szanse logarytmiczne, co jest korzystne z punktu widzenia zobrazowania błędu standardowego. Musimy jednak otrzymane wartości przekształcić funkcją logistyczną.

pred_log_poly <- predict(fit_log_poly, list(energy = energy_grid), se.fit = TRUE)
pred_probs <- plogis(pred_log_poly$fit)
se_bands_logit <- cbind(pred_log_poly$fit + 2 * pred_log_poly$se.fit,
                        pred_log_poly$fit - 2 * pred_log_poly$se.fit)
se_bands <- plogis(se_bands_logit)
plot(dataframe$energy, I(dataframe$streams > 250), xlim = energy_lims, ylim = c(0, 1), 
     col = "darkgrey", cex = 0.5, ylab = "P(streams > 10^9 | energy)")
lines(energy_grid, pred_probs, col = "red", lwd = 2)
matlines(energy_grid, se_bands, lty = "dashed", col = "red")

Wnioski

Największą szanse na bycie popularnymi mają piosenki albo o stosunkowo niskiej energii (około 30) albo wysokiejh (około 85).

Funkcje schodkowe

Dopasowanie funkcji schodkowej wykonujemy przy pomocy funkcji cut() przekształcającej zmienną numeryczną w czynnik uporządkowany.

table(cut(dataframe$energy, breaks = 4))
## 
## (8.91,31]   (31,53]   (53,75] (75,97.1] 
##        35       206       438       273

Samo dopasowanie wykonuje funkcja lm().

fit_step <- lm(streams ~ cut(energy, 4), data = dataframe)
pred_step <- predict(fit_step, list(energy = energy_grid), se.fit = TRUE)
se_bands <- cbind(pred_step$fit + 2 * pred_step$se.fit, 
                  pred_step$fit - 2 * pred_step$se.fit)
plot(dataframe$energy, dataframe$streams, col = "darkgrey", cex = 0.5, xlim = energy_lims)
lines(energy_grid, pred_step$fit, col = "red", lwd = 2)
matlines(energy_grid, se_bands, col = "red", lty = "dashed")

Funkcje sklejane

Bazę regresyjnych funkcji sklejanych wylicza funkcja bs() z pakietu splines. Domyślnym stopniem funkcji sklejanych jest 3.

Regresja z użyciem funkcji sklejanych z ustalonymi węzłami.

fit_bs_knots <- lm(streams ~ bs(energy, knots = c(25, 40, 60)), data = dataframe)
pred_bs_knots <- predict(fit_bs_knots, list(energy = energy_grid), se.fit = TRUE)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(energy_grid, pred_bs_knots$fit, col = "red", lwd = 2)
lines(energy_grid, pred_bs_knots$fit + 2 * pred_bs_knots$se.fit, col = "red",
      lty = "dashed")
lines(energy_grid, pred_bs_knots$fit - 2 * pred_bs_knots$se.fit, col = "red",
      lty = "dashed")
abline(v = c(25, 40, 60), lty = "dotted")

[Sprawdź jak ustawienie węzłów wpływa na dopasowany model.]

Dopasowanie modelu wykorzystującego funkcje sklejane o ustalonej liczbie stopni swobody. Węzły są rozmieszczane automatycznie.

fit_bs_dataframe <- lm(streams ~ bs(energy, df = 6), data = dataframe)
pred_bs_dataframe <- predict(fit_bs_dataframe, list(energy = energy_grid), se.fit = TRUE)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(energy_grid, pred_bs_dataframe$fit, col = "red", lwd = 2)
lines(energy_grid, pred_bs_dataframe$fit + 2 * pred_bs_dataframe$se.fit, col = "red",
      lty = "dashed")
lines(energy_grid, pred_bs_dataframe$fit - 2 * pred_bs_dataframe$se.fit, col = "red",
      lty = "dashed")
bs_knots <- attr(bs(dataframe$energy, df = 6), "knots")
abline(v = bs_knots, lty = "dotted")

[Sprawdź jak liczba stopni swobody wpływa na dopasowany model.]

[Funkcja bs() akceptuje parametr degree, który ustala stopień funkcji sklejanej. Sprawdź jak w powyższych przykładach wyglądają funkcje sklejane innych stopni.]

Naturalne funkcje sklejane

Bazę naturalnych sześciennych funkcji sklejanych wyznacza funkcja ns() z pakietu splines.

fit_ns <- lm(streams ~ ns(energy, df = 4), data = dataframe)
pred_ns <- predict(fit_ns, list(energy = energy_grid), se.fit = TRUE)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(energy_grid, pred_ns$fit, col = "red", lwd = 2)
lines(energy_grid, pred_ns$fit + 2 * pred_ns$se.fit, col = "red",
      lty = "dashed")
lines(energy_grid, pred_ns$fit - 2 * pred_ns$se.fit, col = "red",
      lty = "dashed")
abline(v = attr(ns(dataframe$energy, df = 4), "knots"), lty = "dotted")

Dopasowanie wygładzającej (sześciennej) funkcji sklejanej do danych wykonuje funkcja smooth.spline(). Możemy dopasować wygładzającą funkcję sklejaną o ustalonej liczbie stopni swobody (tu 16).

fit_smooth_dataframe <- smooth.spline(dataframe$energy, dataframe$streams, df = 16)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(fit_smooth_dataframe, col = "red", lwd = 2)

Można też liczbę stopni swobody wyznaczyć automatycznie korzystając z walidacji krzyżowej.

fit_smooth_cv <- smooth.spline(dataframe$energy, dataframe$streams, cv = TRUE)
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
lines(fit_smooth_cv, col = "red", lwd = 2)

Regresja lokalna

Regresję lokalną (domyślnie wielomianami stopnia 2) wykonuje funkcja loess(). Parametr funkcji o nazwie span odpowiada parametrowi metody \(s\).

spans <- c(0.2, 0.5)
clrs <- c("red", "blue")
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
for (i in 1:length(spans)) {
   fit_loess <- loess(streams ~ energy, span = spans[i], data = dataframe)
   pred_loess <- predict(fit_loess, data.frame(energy = energy_grid))
   lines(energy_grid, pred_loess, col = clrs[i], lwd = 2)
}
legend("topright", legend = paste("s =", spans), col = clrs, lty = 1, lwd = 2)

To samo dla wielomianów stopnia 1.

spans <- c(0.2, 0.5)
clrs <- c("red", "blue")
plot(dataframe$energy, dataframe$streams, cex = 0.5, col = "darkgrey")
for (i in 1:length(spans)) {
   fit_loess <- loess(streams ~ energy, span = spans[i], degree = 1, data = dataframe)
   pred_loess <- predict(fit_loess, data.frame(energy = energy_grid))
   lines(energy_grid, pred_loess, col = clrs[i], lwd = 2)
}
legend("topright", legend = paste("s =", spans), col = clrs, lty = 1, lwd = 2)

Uogólnione modele addytywne (GAMs)

GAM będący rozwinięciem modelu liniowego może być uczony metodą najmniejszych kwadratów przy pomocy funkcji lm().

fit_gam_ls <- lm(streams ~ ns(liveness, df = 4) + ns(energy, df = 5) + speechiness,
                 data = dataframe)
fit_gam_ls
## 
## Call:
## lm(formula = streams ~ ns(liveness, df = 4) + ns(energy, df = 5) + 
##     speechiness, data = dataframe)
## 
## Coefficients:
##           (Intercept)  ns(liveness, df = 4)1  ns(liveness, df = 4)2  
##             490613650             -185679764               18880355  
## ns(liveness, df = 4)3  ns(liveness, df = 4)4    ns(energy, df = 5)1  
##            -309513355             -408063207              192513613  
##   ns(energy, df = 5)2    ns(energy, df = 5)3    ns(energy, df = 5)4  
##             229884824               64003008              450311130  
##   ns(energy, df = 5)5            speechiness  
##             -36553235               -6864911
summary(fit_gam_ls)
## 
## Call:
## lm(formula = streams ~ ns(liveness, df = 4) + ns(energy, df = 5) + 
##     speechiness, data = dataframe)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -591343398 -363975051 -205534747  140592740 3155335490 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            490613650  252430028   1.944 0.052246 .  
## ns(liveness, df = 4)1 -185679764  128958908  -1.440 0.150246    
## ns(liveness, df = 4)2   18880356  139889184   0.135 0.892667    
## ns(liveness, df = 4)3 -309513355  292197500  -1.059 0.289753    
## ns(liveness, df = 4)4 -408063207  232934448  -1.752 0.080128 .  
## ns(energy, df = 5)1    192513614  197400036   0.975 0.329689    
## ns(energy, df = 5)2    229884824  234761136   0.979 0.327719    
## ns(energy, df = 5)3     64003008  143395251   0.446 0.655455    
## ns(energy, df = 5)4    450311130  476840107   0.944 0.345226    
## ns(energy, df = 5)5    -36553236  154269130  -0.237 0.812751    
## speechiness             -6864911    1856933  -3.697 0.000231 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 563900000 on 941 degrees of freedom
## Multiple R-squared:  0.02098,    Adjusted R-squared:  0.01058 
## F-statistic: 2.017 on 10 and 941 DF,  p-value: 0.02889

Ogólniejsze GAM są uczone przy pomocy algorytmu dopasowania wstecznego w funkcji gam() z pakietu gam. Pakiet gam zawiera też funkcje implementujące modele nieparametryczne: s() reprezentującą wygładzające funkcje sklejane i lo() reprezentującą lokalną regresję.

Dopasowanie modelu podobnego do poprzedniego, ale z użyciem wygładzających funkcji sklejanych.

fit_gam_bf <- gam(streams ~ s(liveness, df = 4) + s(energy, df = 5) + speechiness, data = dataframe)
summary(fit_gam_bf)
## 
## Call: gam(formula = streams ~ s(liveness, df = 4) + s(energy, df = 5) + 
##     speechiness, data = dataframe)
## Deviance Residuals:
##        Min         1Q     Median         3Q        Max 
## -577432754 -364782396 -205612599  138098435 3157061286 
## 
## (Dispersion Parameter for gaussian family taken to be 3.176346e+17)
## 
##     Null Deviance: 3.055818e+20 on 951 degrees of freedom
## Residual Deviance: 2.988941e+20 on 940.9999 degrees of freedom
## AIC: 41079.89 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                      Df     Sum Sq    Mean Sq F value    Pr(>F)    
## s(liveness, df = 4)   1 6.8757e+17 6.8757e+17  2.1646 0.1415516    
## s(energy, df = 5)     1 1.6909e+17 1.6909e+17  0.5324 0.4658003    
## speechiness           1 4.3065e+18 4.3065e+18 13.5582 0.0002444 ***
## Residuals           941 2.9889e+20 3.1763e+17                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                     Npar Df  Npar F  Pr(F)
## (Intercept)                               
## s(liveness, df = 4)       3 0.93479 0.4232
## s(energy, df = 5)         4 0.78633 0.5341
## speechiness

Wykres dla modelu dopasowanego funkcją gam().

par(mfrow = c(1, 3))
plot(fit_gam_bf, col = "red", se = TRUE)

Funkcja plot.Gam() działa też dla modeli metody najmniejszych kwadratów, ale wówczas trzeba się do niej odwołać jawnie.

par(mfrow = c(1, 3))
plot.Gam(fit_gam_ls, col = "red", se = TRUE)

Istnieje wersja funkcji anova() porównująca GAMs.

fit_gam_1 <- gam(streams ~ s(energy, df = 5) + speechiness, data = dataframe)
fit_gam_2 <- gam(streams ~ liveness + s(energy, df = 5) + speechiness, data = dataframe)
anova(fit_gam_1, fit_gam_2, fit_gam_bf, test = "F")
## Analysis of Deviance Table
## 
## Model 1: streams ~ s(energy, df = 5) + speechiness
## Model 2: streams ~ liveness + s(energy, df = 5) + speechiness
## Model 3: streams ~ s(liveness, df = 4) + s(energy, df = 5) + speechiness
##   Resid. Df Resid. Dev     Df   Deviance      F Pr(>F)
## 1       945 3.0046e+20                                
## 2       944 2.9978e+20 1.0000 6.7786e+17 2.1341 0.1444
## 3       941 2.9889e+20 3.0001 8.8608e+17 0.9298 0.4257

Dopasowanie modelu wykorzystującego lokalną regresję.

fit_gam_lo <- gam(streams ~ s(liveness, df = 4) + lo(energy, span = 0.7) + speechiness, 
                  data = dataframe)
summary(fit_gam_lo)
## 
## Call: gam(formula = streams ~ s(liveness, df = 4) + lo(energy, span = 0.7) + 
##     speechiness, data = dataframe)
## Deviance Residuals:
##        Min         1Q     Median         3Q        Max 
## -578148127 -365712192 -203669165  140780193 3161423534 
## 
## (Dispersion Parameter for gaussian family taken to be 3.173674e+17)
## 
##     Null Deviance: 3.055818e+20 on 951 degrees of freedom
## Residual Deviance: 2.995249e+20 on 943.7797 degrees of freedom
## AIC: 41076.34 
## 
## Number of Local Scoring Iterations: 1 
## 
## Anova for Parametric Effects
##                            Df     Sum Sq    Mean Sq F value    Pr(>F)    
## s(liveness, df = 4)      1.00 6.9971e+17 6.9971e+17  2.2047 0.1379209    
## lo(energy, span = 0.7)   1.00 1.6834e+17 1.6834e+17  0.5304 0.4666070    
## speechiness              1.00 4.2103e+18 4.2103e+18 13.2663 0.0002849 ***
## Residuals              943.78 2.9952e+20 3.1737e+17                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df  Npar F  Pr(F)
## (Intercept)                                  
## s(liveness, df = 4)        3.0 0.95347 0.4141
## lo(energy, span = 0.7)     1.2 0.94078 0.3501
## speechiness
par(mfrow = c(1, 3))
plot(fit_gam_lo, col = "green", se = TRUE)

GAM w GLM

Regresja logistyczna wykorzystująca GAM

fit_logistic_gam <- gam(I(streams > 1000000000) ~ liveness + s(energy, df = 5) + speechiness, 
                        family = binomial, data = dataframe)
summary(fit_logistic_gam)
## 
## Call: gam(formula = I(streams > 1e+09) ~ liveness + s(energy, df = 5) + 
##     speechiness, family = binomial, data = dataframe)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7059 -0.6277 -0.5900 -0.4728  2.2727 
## 
## (Dispersion Parameter for binomial family taken to be 1)
## 
##     Null Deviance: 836.0694 on 951 degrees of freedom
## Residual Deviance: 825.9861 on 944 degrees of freedom
## AIC: 841.9862 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df Sum Sq Mean Sq F value  Pr(>F)  
## liveness            1   0.74  0.7377  0.7345 0.39165  
## s(energy, df = 5)   1   0.10  0.1032  0.1027 0.74865  
## speechiness         1   6.11  6.1085  6.0820 0.01383 *
## Residuals         944 948.11  1.0044                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar Chisq P(Chi)
## (Intercept)                                
## liveness                                   
## s(energy, df = 5)       4     2.1698 0.7046
## speechiness
par(mfrow = c(1, 3))
plot(fit_logistic_gam, col = "blue", se = TRUE)

Wnioski

Modele sprawdziły się słabo, trzeba było wybrać inny dataset albo przewidywać inne rzeczy :(

Drzewa

Ładowanie zbioru danych

dataframe = read.csv("spotify-2023.csv")
names(dataframe)[names(dataframe) == "danceability_."] <- "danceability"
names(dataframe)[names(dataframe) == "valence_."] <- "valence"
names(dataframe)[names(dataframe) == "energy_."] <- "energy"
names(dataframe)[names(dataframe) == "acousticness_."] <- "acousticness"
names(dataframe)[names(dataframe) == "instrumentalness_."] <- "instrumentalness"
names(dataframe)[names(dataframe) == "liveness_."] <- "liveness"
names(dataframe)[names(dataframe) == "speechiness_."] <- "speechiness"

dataframe = transform(dataframe, liveness = as.numeric(liveness))
dataframe = na.omit(dataframe)

head(dataframe)
##                            track_name    artist.s._name artist_count
## 1 Seven (feat. Latto) (Explicit Ver.)  Latto, Jung Kook            2
## 2                                LALA       Myke Towers            1
## 3                             vampire    Olivia Rodrigo            1
## 4                        Cruel Summer      Taylor Swift            1
## 5                      WHERE SHE GOES         Bad Bunny            1
## 6                            Sprinter Dave, Central Cee            2
##   released_year released_month released_day in_spotify_playlists
## 1          2023              7           14                  553
## 2          2023              3           23                 1474
## 3          2023              6           30                 1397
## 4          2019              8           23                 7858
## 5          2023              5           18                 3133
## 6          2023              6            1                 2186
##   in_spotify_charts   streams in_apple_playlists in_apple_charts
## 1               147 141381703                 43             263
## 2                48 133716286                 48             126
## 3               113 140003974                 94             207
## 4               100 800840817                116             207
## 5                50 303236322                 84             133
## 6                91 183706234                 67             213
##   in_deezer_playlists in_deezer_charts in_shazam_charts bpm key  mode
## 1                  45               10              826 125   B Major
## 2                  58               14              382  92  C# Major
## 3                  91               14              949 138   F Major
## 4                 125               12              548 170   A Major
## 5                  87               15              425 144   A Minor
## 6                  88               17              946 141  C# Major
##   danceability valence energy acousticness instrumentalness liveness
## 1           80      89     83           31                0        8
## 2           71      61     74            7                0       10
## 3           51      32     53           17                0       31
## 4           55      58     72           11                0       11
## 5           65      23     80           14               63       11
## 6           92      66     58           19                0        8
##   speechiness
## 1           4
## 2           4
## 3           6
## 4          15
## 5           6
## 6          24
library(tree)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(gbm)
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

Drzewa decyzyjne

Drzewa decyzyjne są zaimplementowane w pakiecie tree (nieco odmienna implementacja dostępna jest w pakiecie rpart).

Drzewa klasyfikacyjne

Będziemy klasyfikować obserwacje do dwóch klas: wykonywane na zywo i nie wykonywane na żywo. Uzupełniamy zbiór danych

live <- factor(ifelse(dataframe$liveness <= 50, "No", "Yes"))
dataframeLive <- data.frame(dataframe, live)

Budujemy drzewo klasyfikacyjne do predykcji live na podstawie pozostałych zmiennych (poza liveness i kilku innych).

liveness_high_tree <- tree(live ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - liveness, data = dataframeLive)
## Warning in tree(live ~ . - track_name - artist.s._name - released_day - : NAs
## introduced by coercion
summary(liveness_high_tree)
## 
## Classification tree:
## tree(formula = live ~ . - track_name - artist.s._name - released_day - 
##     in_deezer_playlists - in_shazam_charts - liveness, data = dataframeLive)
## Variables actually used in tree construction:
##  [1] "energy"               "in_spotify_charts"    "bpm"                 
##  [4] "valence"              "in_apple_playlists"   "streams"             
##  [7] "in_apple_charts"      "danceability"         "acousticness"        
## [10] "artist_count"         "released_year"        "speechiness"         
## [13] "released_month"       "in_spotify_playlists"
## Number of terminal nodes:  28 
## Residual mean deviance:  0.09891 = 91.49 / 925 
## Misclassification error rate: 0.02623 = 25 / 953

Przedstawienie graficzne dopasowanego modelu

{
plot(liveness_high_tree)
text(liveness_high_tree, pretty = 0)
}

Winoski

Widzimy, że zdecydowanie łatwiej zaklasyfikować piosenkę jako niewykonywaną na żywo. Albo piosneek z dużym liveness (<50) jest mało. W każdym razie drzewo znalazło pewien wąski zakres pozostałych paramterów który pozwala wykryć takie piosenki.

Więcej informacji podaje funkcja print.tree()

liveness_high_tree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##     1) root 953 273.400 No ( 0.967471 0.032529 )  
##       2) energy < 60.5 364  34.770 No ( 0.991758 0.008242 )  
##         4) in_spotify_charts < 34.5 332  13.610 No ( 0.996988 0.003012 )  
##           8) bpm < 83.5 33   8.962 No ( 0.969697 0.030303 )  
##            16) bpm < 81.5 28   0.000 No ( 1.000000 0.000000 ) *
##            17) bpm > 81.5 5   5.004 No ( 0.800000 0.200000 ) *
##           9) bpm > 83.5 299   0.000 No ( 1.000000 0.000000 ) *
##         5) in_spotify_charts > 34.5 32  14.960 No ( 0.937500 0.062500 )  
##          10) valence < 37 9   9.535 No ( 0.777778 0.222222 ) *
##          11) valence > 37 23   0.000 No ( 1.000000 0.000000 ) *
##       3) energy > 60.5 589 225.200 No ( 0.952462 0.047538 )  
##         6) in_apple_playlists < 75.5 400 192.400 No ( 0.935000 0.065000 )  
##          12) bpm < 159 360 186.700 No ( 0.927778 0.072222 )  
##            24) bpm < 153.5 346 158.400 No ( 0.939306 0.060694 )  
##              48) streams < 1.12543e+09 332 139.900 No ( 0.945783 0.054217 )  
##                96) in_apple_charts < 0.5 35   0.000 No ( 1.000000 0.000000 ) *
##                97) in_apple_charts > 0.5 297 135.800 No ( 0.939394 0.060606 )  
##                 194) danceability < 91.5 291 123.900 No ( 0.945017 0.054983 )  
##                   388) in_spotify_charts < 2.5 137  81.360 No ( 0.912409 0.087591 )  
##                     776) acousticness < 54 130  65.430 No ( 0.930769 0.069231 )  
##                      1552) in_apple_charts < 21.5 82  56.740 No ( 0.890244 0.109756 )  
##                        3104) bpm < 126.5 63  24.120 No ( 0.952381 0.047619 )  
##                          6208) bpm < 93 21  17.220 No ( 0.857143 0.142857 )  
##                           12416) artist_count < 1.5 12   0.000 No ( 1.000000 0.000000 ) *
##                           12417) artist_count > 1.5 9  11.460 No ( 0.666667 0.333333 ) *
##                          6209) bpm > 93 42   0.000 No ( 1.000000 0.000000 ) *
##                        3105) bpm > 126.5 19  23.700 No ( 0.684211 0.315789 )  
##                          6210) released_year < 2022.5 13  17.940 No ( 0.538462 0.461538 )  
##                           12420) in_apple_charts < 5.5 6   5.407 No ( 0.833333 0.166667 ) *
##                           12421) in_apple_charts > 5.5 7   8.376 Yes ( 0.285714 0.714286 ) *
##                          6211) released_year > 2022.5 6   0.000 No ( 1.000000 0.000000 ) *
##                      1553) in_apple_charts > 21.5 48   0.000 No ( 1.000000 0.000000 ) *
##                     777) acousticness > 54 7   9.561 No ( 0.571429 0.428571 ) *
##                   389) in_spotify_charts > 2.5 154  37.100 No ( 0.974026 0.025974 )  
##                     778) speechiness < 4.5 54  28.520 No ( 0.925926 0.074074 )  
##                      1556) in_apple_playlists < 26.5 33   0.000 No ( 1.000000 0.000000 ) *
##                      1557) in_apple_playlists > 26.5 21  20.450 No ( 0.809524 0.190476 )  
##                        3114) released_month < 4.5 8   0.000 No ( 1.000000 0.000000 ) *
##                        3115) released_month > 4.5 13  16.050 No ( 0.692308 0.307692 )  
##                          6230) in_apple_playlists < 50 8  11.090 No ( 0.500000 0.500000 ) *
##                          6231) in_apple_playlists > 50 5   0.000 No ( 1.000000 0.000000 ) *
##                     779) speechiness > 4.5 100   0.000 No ( 1.000000 0.000000 ) *
##                 195) danceability > 91.5 6   7.638 No ( 0.666667 0.333333 ) *
##              49) streams > 1.12543e+09 13  14.050 No ( 0.769231 0.230769 )  
##                98) valence < 45.5 6   8.318 No ( 0.500000 0.500000 ) *
##                99) valence > 45.5 7   0.000 No ( 1.000000 0.000000 ) *
##            25) bpm > 153.5 14  18.250 No ( 0.642857 0.357143 )  
##              50) energy < 81 7   0.000 No ( 1.000000 0.000000 ) *
##              51) energy > 81 7   8.376 Yes ( 0.285714 0.714286 ) *
##          13) bpm > 159 40   0.000 No ( 1.000000 0.000000 ) *
##         7) in_apple_playlists > 75.5 189  22.170 No ( 0.989418 0.010582 )  
##          14) released_month < 11.5 178   0.000 No ( 1.000000 0.000000 ) *
##          15) released_month > 11.5 11  10.430 No ( 0.818182 0.181818 )  
##            30) in_spotify_playlists < 8983.5 6   0.000 No ( 1.000000 0.000000 ) *
##            31) in_spotify_playlists > 8983.5 5   6.730 No ( 0.600000 0.400000 ) *

Wnioski

Istotne są energia, in_apple_playlists i bpm.

Metodą zbioru walidacyjnego estymujemy błąd testowy dla drzewa klasyfikacyjnego w rozważanym problemie.

set.seed(1)
n <- nrow(dataframeLive)
train <- sample(n, n / 2)
test <- -train
liveness_high_tree <- tree(live ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - liveness, data = dataframeLive, subset = train)
## Warning in tree(live ~ . - track_name - artist.s._name - released_day - : NAs
## introduced by coercion
tree_class <- predict(liveness_high_tree, newdata = dataframeLive[test,], type = "class")
## Warning in pred1.tree(object, tree.matrix(newdata)): NAs introduced by coercion
table(tree_class, dataframeLive$live[test])
##           
## tree_class  No Yes
##        No  457  17
##        Yes   2   1
mean(tree_class != dataframeLive$live[test])
## [1] 0.03983229

Duże drzewo \(T_0\) dla zbioru uczącego dataframeLive[train,]

{
plot(liveness_high_tree)
text(liveness_high_tree, pretty = 0)
}

Do znalezienia optymalnego poddrzewa stosujemy przycinanie stosowane złożonością. Przy pomocy CV konstruujemy ciąg poddrzew wyznaczony przez malejącą złożoność.

set.seed(1)
liveness_high_cv <- cv.tree(liveness_high_tree, FUN = prune.misclass)
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
liveness_high_cv
## $size
## [1] 14  7  1
## 
## $dev
## [1] 24 24 16
## 
## $k
## [1] -Inf  0.0  0.5
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(liveness_high_cv$size, liveness_high_cv$dev, type = "b")

Składowa liveness_high_cv$dev zawiera liczbę błędów CV. Przycinamy drzewo \(T_0\) do poddrzewa z najmniejszym poziomem błędów CV.

size_opt <- liveness_high_cv$size[which.min(liveness_high_cv$dev)]
# niestety size_opt = 1 i nie da się tego wyświetlić
liveness_high_pruned <- prune.misclass(liveness_high_tree, best = 5)
{
plot(liveness_high_pruned)
text(liveness_high_pruned, pretty = 0)
}

Wnioski

Optymalne drzewo miało tylko jeden poziom i z jakiegoś powodu nie mogło zostać narysowane, rysuję wiec takie. Zgadza się to z intuicją co do istotnych cech.

Testowy poziom błędów dla optymalnego poddrzewa.

pruned_class <- predict(liveness_high_pruned, newdata = dataframeLive[test,], 
                        type = "class")
## Warning in pred1.tree(object, tree.matrix(newdata)): NAs introduced by coercion
table(pruned_class, dataframeLive$live[test])
##             
## pruned_class  No Yes
##          No  457  17
##          Yes   2   1
mean(pruned_class != dataframeLive$live[test])
## [1] 0.03983229

Drzewa regresyjne

liveness_tree <- tree(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe)
## Warning in tree(liveness ~ . - track_name - artist.s._name - released_day - :
## NAs introduced by coercion
summary(liveness_tree)
## 
## Regression tree:
## tree(formula = liveness ~ . - track_name - artist.s._name - released_day - 
##     in_deezer_playlists - in_shazam_charts, data = dataframe)
## Variables actually used in tree construction:
## [1] "energy" "bpm"   
## Number of terminal nodes:  3 
## Residual mean deviance:  177.7 = 168900 / 950 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -47.750  -8.759  -5.759   0.000   5.241  79.240

Deviance oznacza tutaj RSS. Przedstawienie drzewa

{
liveness_tree
plot(liveness_tree)
text(liveness_tree)
}

Wnioski

Ponownie istotna jest energia i BPM, ma sens pod kątem wystąpień na żywo.

Metodą zbioru walidacyjnego szacujemy błąd testowy.

set.seed(1)
n <- nrow(dataframe)
train <- sample(n, n / 2)
test <- -train
liveness_tree <- tree(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, subset = train)
## Warning in tree(liveness ~ . - track_name - artist.s._name - released_day - :
## NAs introduced by coercion
liveness_pred <- predict(liveness_tree, newdata = dataframe[test,])
## Warning in pred1.tree(object, tree.matrix(newdata)): NAs introduced by coercion
mean((liveness_pred - dataframe$liveness[test])^2)
## [1] 230.8634
{
plot(liveness_tree)
text(liveness_tree)
}

Wyznaczamy optymalne poddrzewo metodą przycinania sterowanego złożonością.

liveness_cv <- cv.tree(liveness_tree)
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by
## coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
plot(liveness_cv$size, liveness_cv$dev, type = "b")

Wnioski

Najlepiej działa drzewo o rozmiarze 3.

Przycinanie drzewa \(T_0\) do żądanego poziomu realizuje w tym przypadku funkcja prune.tree().

liveness_pruned <- prune.tree(liveness_tree, best = 4)
plot(liveness_pruned)
text(liveness_pruned)

[Oblicz estymatę błędu testowego dla poddrzewa z 4 liśćmi.]

Bagging i lasy losowe

Bagging i lasy losowe implementowane są przez pakiet randomForest. Oczywiście bagging jest szczególnym przypadkiem lasu losowego.

Bagging

Bagging dla regresji liveness względem wszystkich pozostałych w zbiorze dataframe.

liveness_bag <- randomForest(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, mtry = 13, importance = TRUE)
liveness_bag
## 
## Call:
##  randomForest(formula = liveness ~ . - track_name - artist.s._name -      released_day - in_deezer_playlists - in_shazam_charts, data = dataframe,      mtry = 13, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 197.1739
##                     % Var explained: -4.99

Wykres błędu OOB względem liczby drzew

plot(liveness_bag, type = "l")

Wnioski

Dodawanie więcej niż 100/200 drzew nie poprawia już wyniku.

W przypadku regresji błąd MSE OOB dostępny jest w składowej mse obiektu klasy randomForest. W przypadku klasyfikacji wyniki analizy danych OOB dostępne są w składowych err.rate (proporcja błędów) i confusion (tabela pomyłek).

Wyznaczenie ważności predyktorów

importance(liveness_bag)
##                          %IncMSE IncNodePurity
## artist_count          5.67845283     4269.1055
## released_year         4.50259927     5008.6306
## released_month        1.09900812     7015.6625
## in_spotify_playlists  7.88154381    16357.8599
## in_spotify_charts     6.00499571     6282.4111
## streams              -0.91095539    12706.5470
## in_apple_playlists    6.74829503    10859.3255
## in_apple_charts       2.38574303     9361.9722
## in_deezer_charts      3.67634998     4039.2470
## bpm                   5.01592523    14941.9088
## key                  -0.00522467     6909.8455
## mode                  0.94644763     1613.6739
## danceability          6.62912780    13196.8077
## valence               3.68052570    11567.1685
## energy               19.15120059    17393.8035
## acousticness         15.28882140    16927.8176
## instrumentalness     -1.35875602      670.8521
## speechiness           0.40185385     7738.9384

I stosowny obrazek

varImpPlot(liveness_bag)

Wnioski

Widzimy, że zdecydowanie najważneijsza jest energia i akustyczność. Potem także liczba playlist do których należy piosenka. Nie liczy się klucz czy instrumentalność.

Oszacowanie błędu testowego dla poprzednio wyznaczonego zbioru walidacyjnego.

set.seed(2)
liveness_bag <- randomForest(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, subset = train, mtry = 13,
                         importance = TRUE)
liveness_pred_bag <- predict(liveness_bag, newdata = dataframe[test,])
mean((liveness_pred_bag - dataframe$liveness[test])^2)
## [1] 200.766
varImpPlot(liveness_bag)

Wnioski

Cechy delikatnie się zmieniły.

Powyższe dla mniejszej liczby hodowanych drzew

set.seed(2)
liveness_bag_s <- randomForest(liveness ~ ., data = dataframe, subset = train, mtry = 13,
                         importance = TRUE, ntree = 25)
liveness_pred_bag_s <- predict(liveness_bag_s, newdata = dataframe[test,])
mean((liveness_pred_bag_s - dataframe$liveness[test])^2)
## [1] 207.2359

Lasy losowe

Domyślna wartość parametru mtry to \(\sqrt{p}\) dla regresji i \(p/3\) dla klasyfikacji.

Oszacowanie błędu testowego dla poprzednio wyznaczonego zbioru walidacyjnego.

set.seed(2)
liveness_rf <- randomForest(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, subset = train,
                         importance = TRUE)
liveness_pred_rf <- predict(liveness_rf, newdata = dataframe[test,])
mean((liveness_pred_rf - dataframe$liveness[test])^2)
## [1] 198.8062

Wnioski

Wynik nieco lepszy niż przy baggingu

Powyższe dla ręcznie ustawionego parametru \(m\) (czyli mtry).

set.seed(2)
liveness_rf <- randomForest(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts, data = dataframe, subset = train, mtry = 6,
                         importance = TRUE)
liveness_pred_rf <- predict(liveness_rf, newdata = dataframe[test,])
mean((liveness_pred_rf - dataframe$liveness[test])^2)
## [1] 198.8062

Boosting

Używamy algorytmów boostingu dla drzew decyzyjnych zaimplementowanych w pakiecie gbm. Inną implementację — wydajną i często pojawiającą się w zastosowaniach — zawiera pakiet xgboost.

Boosting dla regresji liveness względem pozostałych zmiennych ze zbioru dataframe. Funkcją dopasowującą model jest gbm() z istotnymi parametrami:

  • distribution: "gaussian" dla regresji z RSS, "bernoulli" dla regresji typu logistycznego;

  • n.trees: liczba hodowanych drzew (\(B\));

  • interaction.depth: głębokość interakcji (\(d\));

  • shrinkage: parametr spowalniający uczenie (\(\lambda\)).

liveness_boost <- gbm(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - streams - key - mode, data = dataframe, distribution = "gaussian",
                  n.trees = 5000, interaction.depth = 4)
liveness_boost
## gbm(formula = liveness ~ . - track_name - artist.s._name - released_day - 
##     in_deezer_playlists - in_shazam_charts - streams - key - 
##     mode, distribution = "gaussian", data = dataframe, n.trees = 5000, 
##     interaction.depth = 4)
## A gradient boosted model with gaussian loss function.
## 5000 iterations were performed.
## There were 15 predictors of which 15 had non-zero influence.

Funkcja summary.gbm() wyznacza ważność predyktorów i (domyślnie) wykonuje odpowiedni wykres.

summary(liveness_boost)

##                                       var   rel.inf
## in_spotify_playlists in_spotify_playlists 12.084400
## energy                             energy 10.999656
## bpm                                   bpm  9.768195
## valence                           valence  9.478973
## danceability                 danceability  9.103650
## acousticness                 acousticness  8.980394
## in_apple_playlists     in_apple_playlists  8.670737
## in_apple_charts           in_apple_charts  7.806939
## speechiness                   speechiness  5.316668
## in_spotify_charts       in_spotify_charts  5.109875
## released_month             released_month  4.364729
## released_year               released_year  3.110387
## in_deezer_charts         in_deezer_charts  2.509551
## artist_count                 artist_count  1.874445
## instrumentalness         instrumentalness  0.821399

Wnioski

Pokrywa się ze wcześniejszymi odkryciami, liczba wystąpień na playlistach Spotify, energia i bpm najważneiszjymi cechami.

Funkcja plot.gbm() wykonuje wykresy częściowej zaleźności.

plot(liveness_boost, i.var = "acousticness")

plot(liveness_boost, i.var = "energy")

plot(liveness_boost, i.var = c("acousticness", "energy"))

Oszacowanie błędu testowego dla poprzednio wyznaczonego zbioru walidacyjnego.

set.seed(2)
liveness_boost <- gbm(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - streams - key - mode, data = dataframe[train,], distribution = "gaussian",
                  interaction.depth = 4, n.trees = 5000)
liveness_pred_boost <- predict(liveness_boost, newdata = dataframe[test,], n.trees = 5000)
mean((liveness_pred_boost - dataframe$liveness[test])^2)
## [1] 259.4762

To samo dla \(\lambda = 0.01\).

set.seed(2)
liveness_boost <- gbm(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - streams - key - mode, data = dataframe[train,], distribution = "gaussian",
                  interaction.depth = 4, n.trees = 5000, shrinkage = 0.01)
liveness_pred_boost <- predict(liveness_boost, newdata = dataframe[test,], n.trees = 5000)
mean((liveness_pred_boost - dataframe$liveness[test])^2)
## [1] 223.2554

To samo dla \(d = 1\).

set.seed(2)
liveness_boost <- gbm(liveness ~ . - track_name - artist.s._name - released_day - in_deezer_playlists - in_shazam_charts - streams - key - mode, data = dataframe[train,], distribution = "gaussian",
                  n.trees = 5000, shrinkage = 0.01)
liveness_pred_boost <- predict(liveness_boost, newdata = dataframe[test,], n.trees = 5000)
mean((liveness_pred_boost - dataframe$liveness[test])^2)
## [1] 207.1783

Wnioski

Wyniki nieco gorsze niż dla prostych lasów losowych.